Introduction to AutoML using lares

The lares package has multiple families of functions to help the analyst or data scientist achieve quality robust analysis without the need of much coding. One of the most complex but valuable functions we have is h2o_automl, which semi-automatically runs the whole pipeline of a Machine Learning model given a dataset and some customizable parameters. AutoML enables you to train high-quality models specific to your needs and accelerate the research and development process.

HELP: Before getting to the code, I recommend checking h2o_automl’s full documentation here or within your R session by running ?lares::h2o_automl. In it you’ll find a brief description of all the parameters you can set into the function to get exactly what you need and control how it behaves.

Pipeline

In short, these are some of the things that happen on its backend:

Mapping h2o_automl

  1. Input a dataframe df and choose which one is the independent variable (y) you’d like to predict. You may set/change the seed argument to guarantee reproducibility of your results.

  2. The function decides if it’s a classification (categorical) or regression (continuous) model looking at the independent variable’s (y) class and number of unique values, which can be control with the thresh parameter.

  3. The dataframe will be split in two: test and train datasets. The proportion of this split can be control with the split argument. This can be replicated with the msplit() function.

  4. You could also center and scale your numerical values before you continue, use the no_outliers to exclude some outliers, and/or impute missing values with MICE. If it’s a classification model, the function can balance (under-sample) your training data. You can control this behavior with the balance argument. Until here, you can replicate the whole process with the model_preprocess() function.

  5. Runs h2o::h2o.automl(...) to train multiple models and generate a leaderboard with the top (max_models or max_time) models trained, sorted by their performance. You can also customize some additional arguments such as nfolds for k-fold cross-validations, exclude_algos and include_algos to exclude or include some algorithms, and any other additional argument you wish to pass to the mother function.

  6. The best model given the default performance metric (which can be changed with stopping_metric parameter) evaluated with cross-validation (customize it with nfolds), will be selected to continue. You can also use the function h2o_selectmodel() to select another model and recalculate/plot everything again using this alternate model.

  7. Performance metrics and plots will be calculated and rendered given the test predictions and test actual values (which were NOT passed to the models as inputs to be trained with). That way, your model’s performance metrics shouldn’t be biased. You can replicate these calculations with the model_metrics() function.

  8. A list with all the inputs, leaderboard results, best selected model, performance metrics, and plots. You can either (play) see the results on console or export them using the export_results() function.

Load the library

Now, let’s (install and) load the library, the data, and dig in:

# install.packages("lares")
library(lares)

# The data we'll use is the Titanic dataset
data(dft)
df <- subset(dft, select = -c(Ticket, PassengerId, Cabin))

NOTE: I’ll randomly set some parameters on each example to give visibility on some of the arguments you can set to your models. Be sure to also check all the print, warnings, and messages shown throughout the process as they may have relevant information regarding your inputs and the backend operations.

Modeling examples

Let’s have a look at three specific examples: classification models (binary and multiple categories) and a regression model. Also, let’s see how we can export our models and put them to work on any environment.

Classification: Binary

Let’s begin with a binary (TRUE/FALSE) model to predict if each passenger Survived:

r <- h2o_automl(df, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
#> 2021-09-10 08:17:40 | Started process...
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is too old (3 months and 21 days)!
#> Please download and install the latest version from http://h2o.ai/download/
#> - INDEPENDENT VARIABLE: Survived
#> - MODEL TYPE: Classification
#> # A tibble: 2 × 5
#>   tag       n     p order  pcum
#>   <lgl> <int> <dbl> <int> <dbl>
#> 1 FALSE   549  61.6     1  61.6
#> 2 TRUE    342  38.4     2 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 & test = 0.3
#> train_size  test_size 
#>        623        268
#> - REPEATED: There were 62 repeated rows which are being suppressed from the train dataset
#> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
#> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
#> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
#> >>> Iterating until 1 models or 600 seconds...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> - EUREKA: Succesfully generated 1 models
#>                           model_id       auc   logloss    aucpr
#> 1 XGBoost_1_AutoML_20210910_081757 0.8634503 0.4341962 0.842081
#>   mean_per_class_error      rmse       mse
#> 1            0.1867555 0.3680467 0.1354583
#> SELECTED MODEL: XGBoost_1_AutoML_20210910_081757
#> - NOTE: The following variables were the least important: SibSp, Pclass.2
#> >>> Running predictions for Survived...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> Target value: TRUE
#> >>> Generating plots...
#> Model (1/1): XGBoost_1_AutoML_20210910_081757
#> Independent Variable: Survived
#> Type: Classification (2 classes)
#> Algorithm: XGBOOST
#> Split: 70% training data (of 891 observations)
#> Seed: 0
#> 
#> Test metrics:
#>    AUC = 0.83534
#>    ACC = 0.20522
#>    PRC = 0.16071
#>    TPR = 0.27273
#>    TNR = 0.16568
#> 
#> Most important variables:
#>    Sex.female (31.9%)
#>    Fare (21.2%)
#>    Age (20.0%)
#>    Pclass.3 (7.3%)
#>    Sex.male (5.0%)
#> Process duration: 25.9s

Let’s take a look at the plots generated into a single dashboard:

plot(r)

We also have several calculations for our model’s performance that may come useful such as a confusion matrix, gain and lift by percentile, area under the curve (AUC), accuracy (ACC), recall or true positive rate (TPR), cross-validation metrics, exact thresholds to maximize each metric, and others:

r$metrics
#> $dictionary
#> [1] "AUC: Area Under the Curve"                                                             
#> [2] "ACC: Accuracy"                                                                         
#> [3] "PRC: Precision = Positive Predictive Value"                                            
#> [4] "TPR: Sensitivity = Recall = Hit rate = True Positive Rate"                             
#> [5] "TNR: Specificity = Selectivity = True Negative Rate"                                   
#> [6] "Logloss (Error): Logarithmic loss [Neutral classification: 0.69315]"                   
#> [7] "Gain: When best n deciles selected, what % of the real target observations are picked?"
#> [8] "Lift: When best n deciles selected, how much better than random is?"                   
#> 
#> $confusion_matrix
#>        Pred
#> Real    FALSE TRUE
#>   FALSE    28  141
#>   TRUE     72   27
#> 
#> $gain_lift
#> # A tibble: 10 × 10
#>    percentile value random target total  gain optimal  lift response score
#>    <fct>      <chr>  <dbl>  <int> <int> <dbl>   <dbl> <dbl>    <dbl> <dbl>
#>  1 1          TRUE    10.1     26    27  26.3    27.3 161.     26.3  93.9 
#>  2 2          TRUE    20.1     20    27  46.5    54.5 131.     20.2  81.1 
#>  3 3          TRUE    30.2     15    27  61.6    81.8 104.     15.2  62.0 
#>  4 4          TRUE    39.9     13    26  74.7   100    87.2    13.1  43.8 
#>  5 5          TRUE    50        4    27  78.8   100    57.6     4.04 22.9 
#>  6 6          TRUE    60.1      8    27  86.9   100    44.6     8.08 14.2 
#>  7 7          TRUE    69.8      4    26  90.9   100    30.3     4.04 10.8 
#>  8 8          TRUE    79.9      5    27  96.0   100    20.2     5.05  7.86
#>  9 9          TRUE    89.9      3    27  99.0   100    10.1     3.03  5.37
#> 10 10         TRUE   100        1    27 100     100     0       1.01  2.57
#> 
#> $metrics
#>       AUC     ACC     PRC     TPR     TNR
#> 1 0.83534 0.20522 0.16071 0.27273 0.16568
#> 
#> $cv_metrics
#> # A tibble: 20 × 8
#>    metric     mean     sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
#>    <chr>     <dbl>  <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#>  1 accuracy  0.835 0.0667      0.888      0.8        0.864      0.734      0.887
#>  2 auc       0.866 0.0387      0.908      0.836      0.896      0.818      0.874
#>  3 err       0.165 0.0667      0.112      0.2        0.136      0.266      0.113
#>  4 err_cou… 20.6   8.26       14         25         17         33         14    
#>  5 f0point5  0.789 0.0672      0.841      0.739      0.810      0.698      0.854
#>  6 f1        0.801 0.0480      0.841      0.752      0.805      0.752      0.854
#>  7 f2        0.815 0.0348      0.841      0.766      0.799      0.814      0.854
#>  8 lift_to…  2.59  0.288       2.84       2.55       2.84       2.14       2.58 
#>  9 logloss   0.434 0.0819      0.343      0.500      0.377      0.537      0.416
#> 10 max_per…  0.223 0.0931      0.159      0.224      0.205      0.379      0.146
#> 11 mcc       0.659 0.117       0.754      0.586      0.700      0.493      0.762
#> 12 mean_pe…  0.829 0.0595      0.877      0.796      0.848      0.742      0.881
#> 13 mean_pe…  0.171 0.0595      0.123      0.204      0.152      0.258      0.119
#> 14 mse       0.135 0.0308      0.104      0.164      0.114      0.172      0.123
#> 15 pr_auc    0.849 0.0415      0.899      0.787      0.865      0.835      0.859
#> 16 precisi…  0.781 0.0800      0.841      0.731      0.814      0.667      0.854
#> 17 r2        0.429 0.111       0.546      0.312      0.499      0.308      0.480
#> 18 recall    0.826 0.0381      0.841      0.776      0.795      0.862      0.854
#> 19 rmse      0.366 0.0415      0.322      0.405      0.338      0.415      0.351
#> 20 specifi…  0.832 0.124       0.914      0.816      0.901      0.621      0.908
#> 
#> $max_metrics
#>                         metric  threshold       value idx
#> 1                       max f1 0.34555702   0.7725490 200
#> 2                       max f2 0.17723659   0.8146487 273
#> 3                 max f0point5 0.72468458   0.8016148 112
#> 4                 max accuracy 0.63740522   0.8170144 135
#> 5                max precision 0.98975676   1.0000000   0
#> 6                   max recall 0.02737888   1.0000000 392
#> 7              max specificity 0.98975676   1.0000000   0
#> 8             max absolute_mcc 0.34555702   0.6174870 200
#> 9   max min_per_class_accuracy 0.34555702   0.8106996 200
#> 10 max mean_per_class_accuracy 0.34555702   0.8132445 200
#> 11                     max tns 0.98975676 380.0000000   0
#> 12                     max fns 0.98975676 242.0000000   0
#> 13                     max fps 0.01139881 380.0000000 399
#> 14                     max tps 0.02737888 243.0000000 392
#> 15                     max tnr 0.98975676   1.0000000   0
#> 16                     max fnr 0.98975676   0.9958848   0
#> 17                     max fpr 0.01139881   1.0000000 399
#> 18                     max tpr 0.02737888   1.0000000 392

The same goes for the plots generated for these metrics. We have the gains and response plots on test data-set, confusion matrix, and ROC curves.

r$plots$metrics
#> $gains

#> 
#> $response

#> 
#> $conf_matrix

#> 
#> $ROC

For all models, regardless of their type (classification or regression), you can check the importance of each variable as well:

head(r$importance)
#>     variable relative_importance scaled_importance importance
#> 1 Sex.female           218.45708         1.0000000 0.31875098
#> 2       Fare           145.14513         0.6644103 0.21178142
#> 3        Age           137.14154         0.6277734 0.20010338
#> 4   Pclass.3            49.86285         0.2282501 0.07275494
#> 5   Sex.male            34.36985         0.1573300 0.05014909
#> 6   Pclass.1            26.77258         0.1225530 0.03906390

r$plots$importance

Classification: Multi-Categorical

Now, let’s run a multi-categorical (+2 labels) model to predict Pclass of each passenger:

r <- h2o_automl(df, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
#> 2021-09-10 08:18:10 | Started process...
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is too old (3 months and 21 days)!
#> Please download and install the latest version from http://h2o.ai/download/
#> - INDEPENDENT VARIABLE: Pclass
#> - MODEL TYPE: Classification
#> # A tibble: 3 × 5
#>   tag       n     p order  pcum
#>   <fct> <int> <dbl> <int> <dbl>
#> 1 n_3     491  55.1     1  55.1
#> 2 n_1     216  24.2     2  79.4
#> 3 n_2     184  20.6     3 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 & test = 0.3
#> train_size  test_size 
#>        623        268
#> - REPEATED: There were 65 repeated rows which are being suppressed from the train dataset
#> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
#> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
#> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
#> >>> Iterating until 3 models or 30 seconds...
#> - EUREKA: Succesfully generated 3 models
#>                           model_id mean_per_class_error   logloss      rmse
#> 1 XGBoost_2_AutoML_20210910_081810            0.4764622 0.8245831 0.5384408
#> 2 XGBoost_1_AutoML_20210910_081810            0.4861030 0.8451478 0.5422224
#> 3 XGBoost_3_AutoML_20210910_081810            0.4904451 0.8522329 0.5440560
#>         mse auc aucpr
#> 1 0.2899185 NaN   NaN
#> 2 0.2940051 NaN   NaN
#> 3 0.2959969 NaN   NaN
#> SELECTED MODEL: XGBoost_2_AutoML_20210910_081810
#> - NOTE: The following variables were the least important: Sex.male, Embarked.Q
#> >>> Running predictions for Pclass...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> Model (1/3): XGBoost_2_AutoML_20210910_081810
#> Independent Variable: Pclass
#> Type: Classification (3 classes)
#> Algorithm: XGBOOST
#> Split: 70% training data (of 891 observations)
#> Seed: 0
#> 
#> Test metrics:
#>    AUC = 0.78078
#>    ACC = 0.68284
#> 
#> Most important variables:
#>    Age (52.0%)
#>    Survived.FALSE (11.9%)
#>    Embarked.C (7.2%)
#>    Survived.TRUE (5.4%)
#>    Sex.female (5.4%)
#> Process duration: 11.4s

Let’s take a look at the plots generated into a single dashboard:

plot(r)

Regression

Finally, a regression model with continuous values to predict Fare payed by passenger:

r <- h2o_automl(df, y = "Fare", ignore = "Pclass", exclude_algos = NULL, quiet = TRUE)
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is too old (3 months and 21 days)!
#> Please download and install the latest version from http://h2o.ai/download/
print(r)
#> Model (1/4): StackedEnsemble_AllModels_AutoML_20210910_081823
#> Independent Variable: Fare
#> Type: Regression
#> Algorithm: STACKEDENSEMBLE
#> Split: 70% training data (of 871 observations)
#> Seed: 0
#> 
#> Test metrics:
#>    rmse = 20.309
#>    mae = 14.244
#>    mape = 0.07304
#>    mse = 412.45
#>    rsq = 0.3169
#>    rsqa = 0.3143

Let’s take a look at the plots generated into a single dashboard:

plot(r)

Export models and results

Once you have you model trained and picked, you can export the model and it’s results, so you can put it to work in a production environment (doesn’t have to be R). There is a function that does all that for you: export_results(). Simply pass your h2o_automl list object into this function and that’s it! You can select which formats will be exported using the which argument. Currently we support: txt, csv, rds, binary, mojo [best format for production], and plots. There are also 2 quick options (dev and production) to export some or all the files. Lastly, you can set a custom subdir to gather everything into a new sub-directory; I’d recommend using the model’s name or any other convention that helps you know which one’s which.

Import and use your models

If you’d like to re-use your exported models to predict new datasets, you have several options:

Addittional Posts