Data Mining Demo: Modelling and forecasting with R and EXASOL

In the following you will find a simple data mining demo that shows modelling and forecasting in the connection of R and Exasol. If you have not performed the installation and configuration of the necessary components yet, you can find an instruction here.

<br />
# Laden der benötigten Pakete<br />
library(RODBC)<br />
library(exasol)<br />
library(rredis)<br />
library(magrittr)<br />
library(stringi)<br />
library(rpart)<br />
library(partykit)</p>
<p># Verbindung zu Exasol und Redis herstellen<br />
con &lt;- odbcConnect(“exasol_vm”)<br />
redisConnect(“172.20.248.13”)</p>
<p># Erstellen eines Zufallszahlen-Vektors für die Stichprobenziehung<br />
rnd &lt;- rnorm(nrow(iris))</p>
<p># Anfügen der Gruppenvariable (Training/Validierung)<br />
iris$groups &lt;- factor(NA, levels = c(“Train”, “Valid”))</p>
<p># Nach Spezies geschichtete Zufallsziehung: 70% Training, 30% Validierung<br />
for(i in unique(iris$Species)) {<br />
  logVec &lt;- iris$Species == i<br />
  iris$groups[logVec] &lt;- ifelse(test = rnd[logVec] &gt; quantile(rnd[logVec],<br />
                                                                 probs = 0.3),<br />
                                yes = “Train”,<br />
                                no = “Valid”)<br />
}</p>
<p># Überprüfung der Stichprobenziehung<br />
table(iris$groups, iris$Species)</p>
<p># Workspace aufräumen<br />
rm(rnd, logVec, i)

# Laden der benötigten Pakete

library(RODBC)

library(exasol)

library(rredis)

library(magrittr)

library(stringi)

library(rpart)

library(partykit)

# Verbindung zu Exasol und Redis herstellen

con <– odbcConnect(“exasol_vm”)

redisConnect(“172.20.248.13”)

# Erstellen eines Zufallszahlen-Vektors für die Stichprobenziehung

rnd <– rnorm(nrow(iris))

# Anfügen der Gruppenvariable (Training/Validierung)

iris$groups <– factor(NA, levels = c(“Train”, “Valid”))

# Nach Spezies geschichtete Zufallsziehung: 70% Training, 30% Validierung

for(i in unique(iris$Species)) {

logVec <– iris$Species == i

iris$groups[logVec] <– ifelse(test = rnd[logVec] > quantile(rnd[logVec],

probs = 0.3),

yes = “Train”,

no = “Valid”)

}

# Überprüfung der Stichprobenziehung

table(iris$groups, iris$Species)

# Workspace aufräumen

rm(rnd, logVec, i)

The iris data are now transferred to the Exasol. At first use, a database schema and an empty table are created. Afterwards, the iris data are written into the table. Once the data are uploaded they can be used time and again.

<br />
# Datenbankschema mit dem Namen <em>my_schema</em> erstellen<br />
odbcQuery(con, “create schema my_schema”)</p>
<p># Erstellen der leeren Tabelle unter dem Namen <em>irisdb</em><br />
odbcQuery(con, “create or replace table my_schema.irisdb(<br />
          SepalLength DOUBLE,<br />
          SepalWidth DOUBLE,<br />
          PetalLength DOUBLE,<br />
          PetalWidth DOUBLE,<br />
          Species CHAR(20),<br />
          Groups CHAR(20));”)</p>
<p># Hochladen der <em>iris</em> Daten in die Exasol<br />
exa.writeData(con, data = iris, tableName = “my_schema.irisdb”)

# Datenbankschema mit dem Namen my_schema erstellen

odbcQuery(con, “create schema my_schema”)

# Erstellen der leeren Tabelle unter dem Namen irisdb

odbcQuery(con, “create or replace table my_schema.irisdb(

SepalLength DOUBLE,

SepalWidth DOUBLE,

PetalLength DOUBLE,

PetalWidth DOUBLE,

Species CHAR(20),

Groups CHAR(20));”)

# Hochladen der iris Daten in die Exasol

exa.writeData(con, data = iris, tableName = “my_schema.irisdb”)

In the first step of the analysis a decision tree is created locally. Based on this tree a forecast is done.

<br />
# Erstellen eines Trainings- und eines Validierungsdatensatzes<br />
train &lt;- subset(iris, subset = groups == “Train”, select = -groups)<br />
valid &lt;- subset(iris, subset = groups == “Valid”, select = -groups)</p>
<p># Erstellen des Entscheidungsbaums mit den Trainingsdaten<br />
localTree &lt;- rpart(Species ~ ., data = train)</p>
<p># Visualisierung des Baums<br />
plot(as.party(localTree))</p>
<p># Prognose Validierungsdaten mit Hilfe des Baums<br />
pred &lt;- predict(localTree, type = “class”, newdata = valid) # Prognose überprüfen table(pred, valid$Species, dnn = c(“Vorhersage”, “Tatsächlich”)) %&gt;%<br />
  addmargins()

# Erstellen eines Trainings- und eines Validierungsdatensatzes

train <– subset(iris, subset = groups == “Train”, select = –groups)

valid <– subset(iris, subset = groups == “Valid”, select = –groups)

# Erstellen des Entscheidungsbaums mit den Trainingsdaten

localTree <– rpart(Species ~ ., data = train)

# Visualisierung des Baums

plot(as.party(localTree))

# Prognose Validierungsdaten mit Hilfe des Baums

pred <– predict(localTree, type = “class”, newdata = valid) # Prognose überprüfen table(pred, valid$Species, dnn = c(“Vorhersage”, “Tatsächlich”)) %>%

addmargins()

The same procedure as above now in the Exasol. The exa.script function creates an R script on the Exasol server. The function call runs the script on the Exasol clusters. This means that the analysis no longer takes place on the local R but on the R instances in the Exasol cluster. Packages which are used on the R instances on the Exasol need to be installed there. Have a look at this blog entry for more information.

The model built in the Exasol cluster will be stored in Redis, a key-value database. With Redis you can distribute models, functions and other R objects in the cluster and load them from the cluster.

<br />
exa_rf &lt;- exa.createScript(<br />
  con,<br />
  “my_schmea.exa_rf”, # Unter diesem Namen ist das R-Script über SQL verfügbar<br />
  function(data) {</p>
<p>    # Laden der benötigten Pakete. Diese müssen ggf. in der Exasol installiert sein<br />
    require(rpart)<br />
    require(stringi)<br />
    require(rredis)</p>
<p>    # Verbindung mit Redis<br />
    redisConnect(“172.20.248.13″, port = 6379)</p>
<p>    # Laden aller Daten aus der Exasol Tabelle<br />
    # wird im Funktionsaufruf das <em>groupBy</em> oder <em>where</em> Argument verwendet,<br />
    # wird nur der entsprechende Teil der Daten geladen.<br />
    data$next_row(NA)</p>
<p>    # Wandeln des <em>data</em> Objekts in einen <em>data.frame</em><br />
    df &lt;- data.frame(v1 = data$SepalLength,<br />
                     v2 = data$SepalWidth,<br />
                     v3 = data$PetalLength,<br />
                     v4 = data$PetalWidth,<br />
                     species = data$Species)</p>
<p>    # Aufbereiten des data.frames<br />
    df$species &lt;- stri_replace_all_fixed(df$species, ” “, “”)<br />
    df$species &lt;- as.factor(df$species)    </p>
<p>    # Erstellen des Baums<br />
    rf &lt;- rpart(species ~ ., data = df)</p>
<p>    # Speichern des Baums in Redis<br />
    redisSet(“exa_rf”, rf)</p>
<p>    # Rückgabe der Zeilenanzahl (zur Kontrolle)<br />
    data$emit(nrow(df))<br />
  },<br />
  inArgs = c(“SepalLength DOUBLE”,<br />
              “SepalWidth DOUBLE”,<br />
              “PetalLength DOUBLE”,<br />
              “PetalWidth DOUBLE”,<br />
              “Species CHAR(20)”),<br />
  outArgs = “Feedback INT”)</p>
<p># Aufrufen der oben gebildeten Funktion. Das where Argument legt fest, dass<br />
# das Modell auf den Trainingsdaten gebildet wird.<br />
exa_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”,<br />
       table = “my_schema.irisdb”,<br />
       where = “groups = ‘Train'”)

exa_rf <– exa.createScript(

con,

“my_schmea.exa_rf”, # Unter diesem Namen ist das R-Script über SQL verfügbar

function(data) {

# Laden der benötigten Pakete. Diese müssen ggf. in der Exasol installiert sein

require(rpart)

require(stringi)

require(rredis)

# Verbindung mit Redis

redisConnect(“172.20.248.13”, port = 6379)

# Laden aller Daten aus der Exasol Tabelle

# wird im Funktionsaufruf das groupBy oder where Argument verwendet,

# wird nur der entsprechende Teil der Daten geladen.

data$next_row(NA)

# Wandeln des data Objekts in einen data.frame

df <– data.frame(v1 = data$SepalLength,

v2 = data$SepalWidth,

v3 = data$PetalLength,

v4 = data$PetalWidth,

species = data$Species)

# Aufbereiten des data.frames

df$species <– stri_replace_all_fixed(df$species, ” “, “”)

df$species <– as.factor(df$species)

# Erstellen des Baums

rf <– rpart(species ~ ., data = df)

# Speichern des Baums in Redis

redisSet(“exa_rf”, rf)

# Rückgabe der Zeilenanzahl (zur Kontrolle)

data$emit(nrow(df))

inArgs = c(“SepalLength DOUBLE”,

“SepalWidth DOUBLE”,

“PetalLength DOUBLE”,

“PetalWidth DOUBLE”,

“Species CHAR(20)”),

outArgs = “Feedback INT”)

# Aufrufen der oben gebildeten Funktion. Das where Argument legt fest, dass

# das Modell auf den Trainingsdaten gebildet wird.

exa_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”,

table = “my_schema.irisdb”,

where = “groups = ‘Train'”)

The tree model can be used for forecasting in a separate step on the Exasol. The first part is mostly identical with the above function.

<br />
exa_predict_rf &lt;- exa.createScript(<br />
  con,<br />
  “my_schema.exa_pred”,<br />
  function(data) {</p>
<p>    require(rpart)<br />
    require(rredis)</p>
<p>    redisConnect(“172.20.248.13”, port = 6379)</p>
<p>    data$next_row(NA)</p>
<p>    df &lt;- data.frame(v1 = data$SepalLength,<br />
                     v2 = data$SepalWidth,<br />
                     v3 = data$PetalLength,<br />
                     v4 = data$PetalWidth,<br />
                     species = data$Species)</p>
<p>    # Laden des Baum Modells aus Redis<br />
    rf &lt;- redisGet(“exa_rf”) </p>
<p>    # Erstellen der Prognose<br />
    pred &lt;- predict(rf, newdata = df, type = “class”)</p>
<p>    # Rückgabe der Prognose sowie der echten Klassenzugehörigkeit<br />
    data$emit(pred, df$species)<br />
  },<br />
  inArgs = c( “SepalLength DOUBLE”,<br />
              “SepalWidth DOUBLE”,<br />
              “PetalLength DOUBLE”,<br />
              “PetalWidth DOUBLE”,<br />
              “Species CHAR(20)”),<br />
  outArgs = c(“Prognose CHAR(20)”,<br />
              “Realwerte CHAR(20)”))</p>
<p># Aufruf der oben gebildeten Funktion. Die Rückgabe wird in ein Objekt gespeichert.<br />
exa_pred &lt;- exa_predict_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”, table = “my_schema.irisdb”, where = “groups = ‘Valid'”) # Mit Hilfe der Table Funktion kann geprüft werden, wie gut die Prognose performt. table(exa_pred$PROGNOSE, exa_pred$REALWERTE, dnn = c(“Prognose”, “Realwerte”)) %&gt;%<br />
  addmargins()

exa_predict_rf <– exa.createScript(

con,

“my_schema.exa_pred”,

function(data) {

require(rpart)

require(rredis)

redisConnect(“172.20.248.13”, port = 6379)

data$next_row(NA)

df <– data.frame(v1 = data$SepalLength,

v2 = data$SepalWidth,

v3 = data$PetalLength,

v4 = data$PetalWidth,

species = data$Species)

# Laden des Baum Modells aus Redis

rf <– redisGet(“exa_rf”)

# Erstellen der Prognose

pred <– predict(rf, newdata = df, type = “class”)

# Rückgabe der Prognose sowie der echten Klassenzugehörigkeit

data$emit(pred, df$species)

inArgs = c( “SepalLength DOUBLE”,

“SepalWidth DOUBLE”,

“PetalLength DOUBLE”,

“PetalWidth DOUBLE”,

“Species CHAR(20)”),

outArgs = c(“Prognose CHAR(20)”,

“Realwerte CHAR(20)”))

# Aufruf der oben gebildeten Funktion. Die Rückgabe wird in ein Objekt gespeichert.

exa_pred <– exa_predict_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”, table = “my_schema.irisdb”, where = “groups = ‘Valid'”) # Mit Hilfe der Table Funktion kann geprüft werden, wie gut die Prognose performt. table(exa_pred$PROGNOSE, exa_pred$REALWERTE, dnn = c(“Prognose”, “Realwerte”)) %>%

addmargins()

<br />
# Load the required packages<br />
library(RODBC)<br />
library(exasol)<br />
library(rredis)<br />
library(magrittr)<br />
library(stringi)<br />
library(rpart)<br />
library(partykit)</p>
<p># Connection to Exasol and Redis<br />
con &lt;- odbcConnect(“exasol_vm”)<br />
redisConnect(“172.20.248.13”)</p>
<p># Random numbers vector for the sampling<br />
rnd &lt;- rnorm(nrow(iris))</p>
<p># Add the group variable (Training/Validierung)<br />
iris$groups &lt;- factor(NA, levels = c(“Train”, “Valid”))</p>
<p># Random draw: 70% Training, 30% Validation<br />
for(i in unique(iris$Species)) {<br />
  logVec &lt;- iris$Species == i<br />
  iris$groups[logVec] &lt;- ifelse(test = rnd[logVec] &gt; quantile(rnd[logVec],<br />
                                                                 probs = 0.3),<br />
                                yes = “Train”,<br />
                                no = “Valid”)<br />
}</p>
<p># Review of the sampling<br />
table(iris$groups, iris$Species)</p>
<p># Clean up the Workspace<br />
rm(rnd, logVec, i)

# Load the required packages

library(RODBC)

library(exasol)

library(rredis)

library(magrittr)

library(stringi)

library(rpart)

library(partykit)

# Connection to Exasol and Redis

con <– odbcConnect(“exasol_vm”)

redisConnect(“172.20.248.13”)

# Random numbers vector for the sampling

rnd <– rnorm(nrow(iris))

# Add the group variable (Training/Validierung)

iris$groups <– factor(NA, levels = c(“Train”, “Valid”))

# Random draw: 70% Training, 30% Validation

for(i in unique(iris$Species)) {

logVec <– iris$Species == i

iris$groups[logVec] <– ifelse(test = rnd[logVec] > quantile(rnd[logVec],

probs = 0.3),

yes = “Train”,

no = “Valid”)

}

# Review of the sampling

table(iris$groups, iris$Species)

# Clean up the Workspace

rm(rnd, logVec, i)

<br />
# Create a Database scheme with the name <em>my_schema</em><br />
odbcQuery(con, “create schema my_schema”)</p>
<p># Create an empty table with the name <em>irisdb</em><br />
odbcQuery(con, “create or replace table my_schema.irisdb(<br />
          SepalLength DOUBLE,<br />
          SepalWidth DOUBLE,<br />
          PetalLength DOUBLE,<br />
          PetalWidth DOUBLE,<br />
          Species CHAR(20),<br />
          Groups CHAR(20));”)</p>
<p># Upload of the <em>iris</em> Data in the Exasol<br />
exa.writeData(con, data = iris, tableName = “my_schema.irisdb”)

# Create a Database scheme with the name my_schema

odbcQuery(con, “create schema my_schema”)

# Create an empty table with the name irisdb

odbcQuery(con, “create or replace table my_schema.irisdb(

SepalLength DOUBLE,

SepalWidth DOUBLE,

PetalLength DOUBLE,

PetalWidth DOUBLE,

Species CHAR(20),

Groups CHAR(20));”)

# Upload of the iris Data in the Exasol

exa.writeData(con, data = iris, tableName = “my_schema.irisdb”)

In the first step of the analysis a decision tree is created locally. Based on this tree a forecast is done.

<br />
# Creating two data sets<br />
train &lt;- subset(iris, subset = groups == “Train”, select = -groups)<br />
valid &lt;- subset(iris, subset = groups == “Valid”, select = -groups)</p>
<p># Decission tree<br />
localTree &lt;- rpart(Species ~ ., data = train)</p>
<p># Visualisation of the tree<br />
plot(as.party(localTree))</p>
<p># Forecast<br />
pred &lt;- predict(localTree, type = “class”, newdata = valid) # Check forecast table(pred, valid$Species, dnn = c(“Vorhersage”, “Tatsächlich”)) %&gt;%<br />
  addmargins()

# Creating two data sets

train <– subset(iris, subset = groups == “Train”, select = –groups)

valid <– subset(iris, subset = groups == “Valid”, select = –groups)

# Decission tree

localTree <– rpart(Species ~ ., data = train)

# Visualisation of the tree

plot(as.party(localTree))

# Forecast

pred <– predict(localTree, type = “class”, newdata = valid) # Check forecast table(pred, valid$Species, dnn = c(“Vorhersage”, “Tatsächlich”)) %>%

addmargins()

The model built in the Exasol cluster will be stored in Redis, a key-value database. With Redis you can distribute models, functions and other R objects in the cluster and load them from the cluster.

<br />
exa_rf &lt;- exa.createScript(<br />
  con,<br />
  “my_schmea.exa_rf”, # Unter diesem Namen ist das R-Script über SQL verfügbar<br />
  function(data) {</p>
<p>    # Load the required packages<br />
    require(rpart)<br />
    require(stringi)<br />
    require(rredis)</p>
<p>    # Connection with Redis<br />
    redisConnect(“172.20.248.13″, port = 6379)</p>
<p>    # Loading all data from the Exasol table</p>
<p>    data$next_row(NA)</p>
<p>    # Transform the <em>data</em> Object into <em>data.frame</em><br />
    df &lt;- data.frame(v1 = data$SepalLength,<br />
                     v2 = data$SepalWidth,<br />
                     v3 = data$PetalLength,<br />
                     v4 = data$PetalWidth,<br />
                     species = data$Species)</p>
<p>    # Prepare the data.frames<br />
    df$species &lt;- stri_replace_all_fixed(df$species, ” “, “”)<br />
    df$species &lt;- as.factor(df$species)    </p>
<p>    # Generate a tree<br />
    rf &lt;- rpart(species ~ ., data = df)</p>
<p>    # Save the tree in Redis<br />
    redisSet(“exa_rf”, rf)</p>
<p>    # Return the line number<br />
    data$emit(nrow(df))<br />
  },<br />
  inArgs = c(“SepalLength DOUBLE”,<br />
              “SepalWidth DOUBLE”,<br />
              “PetalLength DOUBLE”,<br />
              “PetalWidth DOUBLE”,<br />
              “Species CHAR(20)”),<br />
  outArgs = “Feedback INT”)</p>
<p># Call of the function.<br />
exa_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”,<br />
       table = “my_schema.irisdb”,<br />
       where = “groups = ‘Train'”)

exa_rf <– exa.createScript(

con,

“my_schmea.exa_rf”, # Unter diesem Namen ist das R-Script über SQL verfügbar

function(data) {

# Load the required packages

require(rpart)

require(stringi)

require(rredis)

# Connection with Redis

redisConnect(“172.20.248.13”, port = 6379)

# Loading all data from the Exasol table

data$next_row(NA)

# Transform the data Object into data.frame

df <– data.frame(v1 = data$SepalLength,

v2 = data$SepalWidth,

v3 = data$PetalLength,

v4 = data$PetalWidth,

species = data$Species)

# Prepare the data.frames

df$species <– stri_replace_all_fixed(df$species, ” “, “”)

df$species <– as.factor(df$species)

# Generate a tree

rf <– rpart(species ~ ., data = df)

# Save the tree in Redis

redisSet(“exa_rf”, rf)

# Return the line number

data$emit(nrow(df))

inArgs = c(“SepalLength DOUBLE”,

“SepalWidth DOUBLE”,

“PetalLength DOUBLE”,

“PetalWidth DOUBLE”,

“Species CHAR(20)”),

outArgs = “Feedback INT”)

# Call of the function.

exa_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”,

table = “my_schema.irisdb”,

where = “groups = ‘Train'”)

The tree model can be used for forecasting in a separate step on the Exasol. The first part is mostly identical with the above function.

<br />
exa_predict_rf &lt;- exa.createScript(<br />
  con,<br />
  “my_schema.exa_pred”,<br />
  function(data) {</p>
<p>    require(rpart)<br />
    require(rredis)</p>
<p>    redisConnect(“172.20.248.13”, port = 6379)</p>
<p>    data$next_row(NA)</p>
<p>    df &lt;- data.frame(v1 = data$SepalLength,<br />
                     v2 = data$SepalWidth,<br />
                     v3 = data$PetalLength,<br />
                     v4 = data$PetalWidth,<br />
                     species = data$Species)</p>
<p>    # Loading the tree model out of Redis<br />
    rf &lt;- redisGet(“exa_rf”) </p>
<p>    # Creating a Forecast<br />
    pred &lt;- predict(rf, newdata = df, type = “class”)</p>
<p>    # Return of the forecast<br />
    data$emit(pred, df$species)<br />
  },<br />
  inArgs = c( “SepalLength DOUBLE”,<br />
              “SepalWidth DOUBLE”,<br />
              “PetalLength DOUBLE”,<br />
              “PetalWidth DOUBLE”,<br />
              “Species CHAR(20)”),<br />
  outArgs = c(“Prognose CHAR(20)”,<br />
              “Realwerte CHAR(20)”))</p>
<p># Call of the function. The return is stored in an object.<br />
exa_pred &lt;- exa_predict_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”, table = “my_schema.irisdb”, where = “groups = ‘Valid'”) # Performance testing table(exa_pred$PROGNOSE, exa_pred$REALWERTE, dnn = c(“Prognose”, “Realwerte”)) %&gt;%<br />
  addmargins()

exa_predict_rf <– exa.createScript(

con,

“my_schema.exa_pred”,

function(data) {

require(rpart)

require(rredis)

redisConnect(“172.20.248.13”, port = 6379)

data$next_row(NA)

df <– data.frame(v1 = data$SepalLength,

v2 = data$SepalWidth,

v3 = data$PetalLength,

v4 = data$PetalWidth,

species = data$Species)

# Loading the tree model out of Redis

rf <– redisGet(“exa_rf”)

# Creating a Forecast

pred <– predict(rf, newdata = df, type = “class”)

# Return of the forecast

data$emit(pred, df$species)

inArgs = c( “SepalLength DOUBLE”,

“SepalWidth DOUBLE”,

“PetalLength DOUBLE”,

“PetalWidth DOUBLE”,

“Species CHAR(20)”),

outArgs = c(“Prognose CHAR(20)”,

“Realwerte CHAR(20)”))

# Call of the function. The return is stored in an object.

exa_pred <– exa_predict_rf(“SepalLength”, “SepalWidth”, “PetalLength”, “PetalWidth”, “Species”, table = “my_schema.irisdb”, where = “groups = ‘Valid'”) # Performance testing table(exa_pred$PROGNOSE, exa_pred$REALWERTE, dnn = c(“Prognose”, “Realwerte”)) %>%

addmargins()