1 Introduction

The enpls package offers an algorithmic framework for measuring feature importance, detecting outliers, and ensemble modeling based on (sparse) partial least squares regression. The key functions included in the package are listed in the table below.

Task Partial Least Squares Sparse Partial Least Squares
Model fitting enpls.fit() enspls.fit()
Cross validation cv.enpls() cv.enspls()
Detect outliers enpls.od() enspls.od()
Measure feature importance enpls.fs() enspls.fs()
Evaluate applicability domain enpls.ad() enspls.ad()

Next, we will use the data from (Wang et al. 2015) to demonstrate the general workflow of enpls. The dataset contains 1,000 compounds, each characterized by 80 molecular descriptors. The response is the octanol/water partition coefficient at pH 7.4 (logD7.4).

Let’s load the data and take a look at it:

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

x <- logd1k$x
y <- logd1k$y
head(x)[, 1:5]
##   BalabanJ  BertzCT   Chi0  Chi0n  Chi0v
## 1    1.949  882.760 16.845 13.088 13.088
## 2    1.970  781.936 15.905 13.204 14.021
## 3    2.968  343.203  9.845  7.526  7.526
## 4    2.050 1133.679 19.836 15.406 15.406
## 5    2.719  437.346 12.129  9.487  9.487
## 6    2.031  983.304 19.292 15.289 15.289
## [1] -0.96 -0.92 -0.90 -0.83 -0.82 -0.79

2 Model Fitting

Here we fit the ensemble sparse partial least squares to the data, so that the model complexity could usually be further reduced than vanilla partial least squares when we build each model.

fit <- enspls.fit(x, y, ratio = 0.7, reptimes = 20, maxcomp = 3)
y.pred <- predict(fit, newx = x)

df <- data.frame(y, y.pred)
ggplot(df, aes_string(x = "y", y = "y.pred")) +
  geom_abline(slope = 1, intercept = 0, colour = "darkgrey") +
  geom_point(size = 3, shape = 1, alpha = 0.8) +
  coord_fixed(ratio = 1) +
  xlab("Observed Response") +
  ylab("Predicted Response")