R is an open source tool and contribution is made from various fields. In many ways, this can have positive influences but there is a concern that its syntax or API can be quite non-standardized . I’ve looked for a way that various models can be applied consistently for a while and the following two packages are found: mlr and caret .

Among these, I’m more interested in mlr as its objects that are related to model building seem to be structured more clearly and strictly, which can be additional benefit while analysis gets complicated.

According to the tutorial , typical analysis would be performed as following.

Imputation, Processing …
Task
Learner
Train
Predict
Performance
Resampling
Benchmark
In this article, steps 1 to 5 are briefly introduced by reiterating Chapter 4 Lab of ISLR - i.e. logistic regression is fit on Smarket data using stats and mlr packages.

In order to fit logistic regression using mlr , the development version (v2.3) has to be installed as `classif.binomial`

is yet to be added into the stable version (v2.2) - for further details, see here . Note that Rtools should be installed to build the packages on Windows.

# dev install
devtools :: install_github ( "berndbischl/ParamHelpers" )
devtools :: install_github ( "berndbischl/BBmisc" )
devtools :: install_github ( "berndbischl/parallelMap" )
devtools :: install_github ( "berndbischl/mlr" )

The following packages are loaded.

library ( ISLR )
library ( mlr )

The structure of the data (Smarket ) is shown below - Year and Today are going to be excluded.

str ( Smarket )

```
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
```

Test on training data
ISLR
Matching objects for Task and Learner steps don’t exist on stats package.

3. Train
glm.mod = glm ( Direction ~ . , data = subset ( Smarket , select = c ( - Year , - Today )), family = binomial )
summary ( glm.mod )

```
##
## Call:
## glm(formula = Direction ~ ., family = binomial, data = subset(Smarket,
## select = c(-Year, -Today)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
```

# coef(glm.mod) # also summary(glm.mod)$coef[,4]

4. Predict
# type = c("link","response","terms"), default - "link"
glm.probs = predict ( glm.mod , type = "response" )
head ( glm.probs ) # check positive level (1): contrasts(Smarket$Direction)

```
## 1 2 3 4 5 6
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565
```

glm.pred = sapply ( glm.probs , function ( p ) { ifelse ( p > .5 , "Up" , "Down" ) })
head ( glm.pred )

```
## 1 2 3 4 5 6
## "Up" "Down" "Down" "Up" "Up" "Up"
```

glm.cm = table ( data.frame ( response = glm.pred , truth = Smarket $ Direction ))
glm.cm # confusion matrix

```
## truth
## response Down Up
## Down 145 141
## Up 457 507
```

# mean misclassificaton error rate 47.84%
glm.mmse = 1 - sum ( diag ( glm.cm )) / sum ( glm.cm )
glm.mmse

`## [1] 0.4784`

MLR
1. Task
mlr.glm.task = makeClassifTask ( id = "Smarket" , data = subset ( Smarket , select = c ( - Year , - Today )), positive = "Up" , target = "Direction" )
mlr.glm.task

```
## Supervised task: Smarket
## Type: classif
## Target: Direction
## Observations: 1250
## Features:
## numerics factors ordered
## 6 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Classes: 2
## Down Up
## 602 648
## Positive class: Up
```

2. Learner
mlr.glm.lrn = makeLearner ( "classif.binomial" , predict.type = "prob" )
mlr.glm.lrn

```
## Learner classif.binomial from package stats
## Type: classif
## Name: Binomial Regression; Short name: binomial
## Class: classif.binomial
## Properties: twoclass,numerics,factors,prob,weights
## Predict-Type: prob
## Hyperparameters:
```

3. Train
mlr.glm.mod = train ( mlr.glm.lrn , mlr.glm.task )
mlr.glm.mod

```
## Model for id = classif.binomial class = classif.binomial
## Trained on obs: 1250
## Used features: 6
## Hyperparameters:
```

4. Predict
## prediction - model + either task or newdata but not both
mlr.glm.pred = predict ( object = mlr.glm.mod , task = mlr.glm.task )
mlr.glm.pred

```
## Prediction: 1250 observations
## predict.type: prob
## threshold: Down=0.50,Up=0.50
## time: 0.00
## id truth prob.Down prob.Up response
## 1 1 Up 0.4929159 0.5070841 Up
## 2 2 Up 0.5185321 0.4814679 Down
## 3 3 Down 0.5188612 0.4811388 Down
## 4 4 Up 0.4847776 0.5152224 Up
## 5 5 Up 0.4892188 0.5107812 Up
## 6 6 Up 0.4930435 0.5069565 Up
```

mlr.glm.cm = table ( subset ( mlr.glm.pred $ data , select = c ( "response" , "truth" )))
mlr.glm.cm

```
## truth
## response Down Up
## Down 145 141
## Up 457 507
```

mlr.glm.mmse = 1 - sum ( diag ( mlr.glm.cm )) / sum ( mlr.glm.cm )
mlr.glm.mmse

`## [1] 0.4784`

# or simply
mlr.glm.mmse = performance ( mlr.glm.pred ) # default: mean misclassification error (mmse)
mlr.glm.mmse

```
## mmce
## 0.4784
```

# see available measures - listMeasures(task.name)

So far the model is tested on the training data and the resulting test error (mean misclassifcation error) is likely to be overly optimistic. Below the model is tested on an independent data to obtain a more reliable test error rate.

Test on independent data
Smarket.train = subset ( Smarket , Year != 2005 , select = c ( - Year , - Today ))
Smarket.test = subset ( Smarket , Year == 2005 , select = c ( - Year , - Today ))

ISLR
3. Train
glm.mod.new = glm ( Direction ~ . , data = Smarket.train , family = binomial )
glm.mod.new

```
##
## Call: glm(formula = Direction ~ ., family = binomial, data = Smarket.train)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4
## 0.191213 -0.054178 -0.045805 0.007200 0.006441
## Lag5 Volume
## -0.004223 -0.116257
##
## Degrees of Freedom: 997 Total (i.e. Null); 991 Residual
## Null Deviance: 1383
## Residual Deviance: 1381 AIC: 1395
```

4. Predict
glm.probs.new = predict ( glm.mod.new , newdata = Smarket.test , type = "response" )
head ( glm.probs.new )

```
## 999 1000 1001 1002 1003 1004
## 0.5282195 0.5156688 0.5226521 0.5138543 0.4983345 0.5010912
```

glm.pred.new = sapply ( glm.probs.new , function ( p ) { ifelse ( p > .5 , "Up" , "Down" ) })
head ( glm.pred.new )

```
## 999 1000 1001 1002 1003 1004
## "Up" "Up" "Up" "Up" "Down" "Up"
```

glm.cm.new = table ( data.frame ( response = glm.pred.new , truth = Smarket.test $ Direction ))
glm.cm.new

```
## truth
## response Down Up
## Down 77 97
## Up 34 44
```

glm.mmse.new = 1 - sum ( diag ( glm.cm.new )) / sum ( glm.cm.new )
glm.mmse.new

`## [1] 0.5198413`

MLR
1. Task
mlr.glm.task.new = makeClassifTask ( id = "SmarketNew" , data = Smarket.train , positive = "Up" , target = "Direction" )
mlr.glm.task.new

```
## Supervised task: SmarketNew
## Type: classif
## Target: Direction
## Observations: 998
## Features:
## numerics factors ordered
## 6 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Classes: 2
## Down Up
## 491 507
## Positive class: Up
```

2. Learner
mlr.glm.lrn.new = mlr.glm.lrn # learner can be reused

3. Train
mlr.glm.mod.new = train ( mlr.glm.lrn.new , mlr.glm.task.new )
mlr.glm.mod.new

```
## Model for id = classif.binomial class = classif.binomial
## Trained on obs: 998
## Used features: 6
## Hyperparameters:
```

4. Predict
## prediction - model + either task or newdata but not both
mlr.glm.pred.new = predict ( object = mlr.glm.mod.new , newdata = Smarket.test )
mlr.glm.pred.new

```
## Prediction: 252 observations
## predict.type: prob
## threshold: Down=0.50,Up=0.50
## time: 0.00
## truth prob.Down prob.Up response
## 999 Down 0.4717805 0.5282195 Up
## 1000 Down 0.4843312 0.5156688 Up
## 1001 Down 0.4773479 0.5226521 Up
## 1002 Up 0.4861457 0.5138543 Up
## 1003 Down 0.5016655 0.4983345 Down
## 1004 Up 0.4989088 0.5010912 Up
```

mlr.glm.mmse.new = performance ( mlr.glm.pred.new )
mlr.glm.mmse.new

```
## mmce
## 0.5198413
```

Above is a quick introcution to mlr . I consider the real benefits of using this package would be demonstrated better in Resampling and Benchmark steps as well as its native support of parallel computing. These will be topics of subsequent articles.