# Regression

library("lessR")

The Regression() function performs multiple facets of a complete regression analysis. Abbreviate with reg().

To illustrate, first read the Employee data included as part of lessR. Read into the default lessR data frame d.

d <- Read("Employee")
##
## >>> Suggestions
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
##     Variable                  Missing  Unique
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

## Default Analysis

### Brief output

The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression.

reg_brief(Salary ~ Years + Pre)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  36
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825
##
##
## Standard deviation of residuals:  11753.478 for 33 degrees of freedom
##
## R-squared:  0.726    Adjusted R-squared:  0.710    PRESS R-squared:  0.659
##
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000
##
##
##             df           Sum Sq          Mean Sq   F-value   p-value
##     Years    1  12107157290.292  12107157290.292    87.641     0.000
##       Pre    1      1639658.444      1639658.444     0.012     0.914
##
## Model        2  12108796948.736   6054398474.368    43.827     0.000
## Residuals   33   4558759843.773    138144237.690
## Salary      35  16667556792.508    476215908.357
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##   RESIDUALS AND INFLUENCE
##
##   FORECASTING ERROR

### Full output

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The idea is to provide all of the information you need for a proper regression analysis.

reg(Salary ~ Years + Pre)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  36
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825
##
##
## Standard deviation of residuals:  11753.478 for 33 degrees of freedom
##
## R-squared:  0.726    Adjusted R-squared:  0.710    PRESS R-squared:  0.659
##
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000
##
##
##             df           Sum Sq          Mean Sq   F-value   p-value
##     Years    1  12107157290.292  12107157290.292    87.641     0.000
##       Pre    1      1639658.444      1639658.444     0.012     0.914
##
## Model        2  12108796948.736   6054398474.368    43.827     0.000
## Residuals   33   4558759843.773    138144237.690
## Salary      35  16667556792.508    476215908.357
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##          Salary Years  Pre
##   Salary   1.00  0.85 0.03
##    Years   0.85  1.00 0.05
##      Pre   0.03  0.05 1.00
##
##
##         Tolerance       VIF
##   Years     0.998     1.002
##     Pre     0.998     1.002
##
##
##      1   0    0.718      1
##      1   1    0.710      2
##      0   1   -0.028      1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
##   RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20, out of 36 rows of data, or do res_rows="all"]
## -----------------------------------------------------------------------------------------
##                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks
##       Correll, Trevon     21      97 134419.230 110648.843  23770.387  2.424  1.217 0.430
##         James, Leslie     18      70 122563.380 101387.773  21175.607  1.998  0.714 0.156
##         Capelle, Adam     24      83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132
##           Hoang, Binh     15      96 111074.860  91158.659  19916.201  1.860  0.649 0.131
##    Korhalkar, Jessica      2      74  72502.500  49292.181  23210.319  2.171  0.638 0.122
##        Billing, Susan      4      91  72675.260  55484.493  17190.767  1.561  0.472 0.071
##          Singh, Niral      2      59  61055.440  49566.155  11489.285  1.064  0.452 0.068
##        Skrotzki, Sara     18      63  91352.330 101515.627 -10163.297 -0.937 -0.397 0.053
##      Saechao, Suzanne      8      98  55545.250  68362.271 -12817.021 -1.157 -0.390 0.050
##         Kralik, Laura     10      74  92681.190  75303.447  17377.743  1.535  0.287 0.026
##   Anastasiou, Crystal      2      59  56508.320  49566.155   6942.165  0.636  0.270 0.025
##     Langston, Matthew      5      94  49188.960  58681.106  -9492.146 -0.844 -0.268 0.024
##        Afshari, Anbar      6     100  69441.930  61822.925   7619.005  0.689  0.264 0.024
##   Cassinelli, Anastis     10      80  57562.360  75193.857 -17631.497 -1.554 -0.265 0.022
##      Osterman, Pascal      5      69  49704.790  59137.730  -9432.940 -0.826 -0.216 0.016
##   Bellingar, Samantha     10      67  66337.830  75431.301  -9093.471 -0.793 -0.198 0.013
##          LaRoe, Maria     10      80  61961.290  75193.857 -13232.567 -1.148 -0.195 0.013
##      Ritchie, Darnell      7      82  53788.260  65403.102 -11614.842 -1.006 -0.190 0.012
##        Sheppard, Cory     14      66  95027.550  88455.199   6572.351  0.579  0.176 0.011
##        Downs, Deborah      7      90  57139.900  65256.982  -8117.082 -0.706 -0.174 0.010
##
##
##   FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast,
## 95% Prediction Intervals
##    [sorted by lower bound of prediction interval]
##    [to see all intervals do pred_rows="all"]
##  ----------------------------------------------
##
##                        Years    Pre     Salary       pred        sf    pi.lwr     pi.upr     width
##          Hamide, Bita      1     83  51036.850  45876.388 12290.483 20871.211  70881.564 50010.352
##          Singh, Niral      2     59  61055.440  49566.155 12619.291 23892.014  75240.296 51348.281
##   Anastasiou, Crystal      2     59  56508.320  49566.155 12619.291 23892.014  75240.296 51348.281
## ...
##          Link, Thomas     10     83  66312.890  75139.062 11933.518 50860.137  99417.987 48557.849
##          LaRoe, Maria     10     80  61961.290  75193.857 11918.048 50946.405  99441.308 48494.903
##   Cassinelli, Anastis     10     80  57562.360  75193.857 11918.048 50946.405  99441.308 48494.903
## ...
##       Correll, Trevon     21     97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747
##         Capelle, Adam     24     83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------

Brief output with standardization of all variables in the model. Plot the residuals as a line connecting each data point to the corresponding point on the regression line.

reg_brief(Salary ~ Years, rescale="z", plot_errors=TRUE)
##
## Rescaled Data, First Six Rows
##                     Salary  Years
## Hamide, Bita        -1.044 -1.466
## Singh, Niral        -0.584 -1.291
## Korhalkar, Jessica  -0.059 -1.291
## Anastasiou, Crystal -0.793 -1.291
## Gvakharia, Kimberly -1.098 -1.116
## Stanley, Emma       -1.269 -1.116

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years, rescale="z", plot_errors=TRUE, Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
## Predictor Variable: Years
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  36
##
## Data are Standardized
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept)    -0.026      0.089   -0.299    0.767      -0.206       0.154
##       Years     0.853      0.090    9.498    0.000       0.670       1.035
##
##
## Standard deviation of residuals:  0.531 for 34 degrees of freedom
##
## R-squared:  0.726    Adjusted R-squared:  0.718    PRESS R-squared:  0.681
##
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 90.217     df: 1 and 34     p-value:  0.000
##
##
##             df    Sum Sq   Mean Sq   F-value   p-value
## Model        1    25.472    25.472    90.217     0.000
## Residuals   34     9.600     0.282
## Salary      35    35.072     1.002
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##   RESIDUALS AND INFLUENCE
##
##   FORECASTING ERROR

### k-fold cross-validation

Specify a cross-validation with the kfold parameter. Here specify three folds. The funciton automatically creates the training and testing data sets.

reg(Salary ~ Years, kfold=3)
## [1] 0.5045434 0.5244471 0.7982615
## [1] 0 0 0
## [1] 0 0 0
## digits_d: 3
##   K-FOLD CROSS-VALIDATION
##
##        Model from Training Data              Applied to Testing Data
##        ----------------------------------   ----------------------------------
## fold    n        se           MSE    Rsq     n        se           MSE    Rsq
##   1 |  24 12244.414 149925677.402  0.742 |  12 13554.852 183734008.419  0.505
##   2 |  24  8753.296  76620183.359  0.813 |  12 17862.434 319066537.185  0.524
##   3 |  24 12701.355 161324429.444  0.679 |  12 10335.938 106831611.281  0.798
##       ----------------------------------    ----------------------------------
## Mean      11233.022 129290096.735  0.745       13917.741 203210718.962  0.609

The standard output also includes $$R^2_{press}$$, the value of $$R^2$$ when applied to new, previously unseen data, a value comparable to the average $$R^2$$ on test data.

## Output as a Stored Object

The output of Regression() can be stored into an R object, here named r. The output object consists of various components that together define a comprehensive regression analysis. R calls the resulting output structure a list object.

r <- reg(Salary ~ Years + Pre)

Entering the name of the object displays the full output.

r
## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  36
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825
##
##
## Standard deviation of residuals:  11753.478 for 33 degrees of freedom
##
## R-squared:  0.726    Adjusted R-squared:  0.710    PRESS R-squared:  0.659
##
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000
##
##
##             df           Sum Sq          Mean Sq   F-value   p-value
##     Years    1  12107157290.292  12107157290.292    87.641     0.000
##       Pre    1      1639658.444      1639658.444     0.012     0.914
##
## Model        2  12108796948.736   6054398474.368    43.827     0.000
## Residuals   33   4558759843.773    138144237.690
## Salary      35  16667556792.508    476215908.357
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##          Salary Years  Pre
##   Salary   1.00  0.85 0.03
##    Years   0.85  1.00 0.05
##      Pre   0.03  0.05 1.00
##
##
##         Tolerance       VIF
##   Years     0.998     1.002
##     Pre     0.998     1.002
##
##
##      1   0    0.718      1
##      1   1    0.710      2
##      0   1   -0.028      1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
##   RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20, out of 36 rows of data, or do res_rows="all"]
## -----------------------------------------------------------------------------------------
##                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks
##       Correll, Trevon     21      97 134419.230 110648.843  23770.387  2.424  1.217 0.430
##         James, Leslie     18      70 122563.380 101387.773  21175.607  1.998  0.714 0.156
##         Capelle, Adam     24      83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132
##           Hoang, Binh     15      96 111074.860  91158.659  19916.201  1.860  0.649 0.131
##    Korhalkar, Jessica      2      74  72502.500  49292.181  23210.319  2.171  0.638 0.122
##        Billing, Susan      4      91  72675.260  55484.493  17190.767  1.561  0.472 0.071
##          Singh, Niral      2      59  61055.440  49566.155  11489.285  1.064  0.452 0.068
##        Skrotzki, Sara     18      63  91352.330 101515.627 -10163.297 -0.937 -0.397 0.053
##      Saechao, Suzanne      8      98  55545.250  68362.271 -12817.021 -1.157 -0.390 0.050
##         Kralik, Laura     10      74  92681.190  75303.447  17377.743  1.535  0.287 0.026
##   Anastasiou, Crystal      2      59  56508.320  49566.155   6942.165  0.636  0.270 0.025
##     Langston, Matthew      5      94  49188.960  58681.106  -9492.146 -0.844 -0.268 0.024
##        Afshari, Anbar      6     100  69441.930  61822.925   7619.005  0.689  0.264 0.024
##   Cassinelli, Anastis     10      80  57562.360  75193.857 -17631.497 -1.554 -0.265 0.022
##      Osterman, Pascal      5      69  49704.790  59137.730  -9432.940 -0.826 -0.216 0.016
##   Bellingar, Samantha     10      67  66337.830  75431.301  -9093.471 -0.793 -0.198 0.013
##          LaRoe, Maria     10      80  61961.290  75193.857 -13232.567 -1.148 -0.195 0.013
##      Ritchie, Darnell      7      82  53788.260  65403.102 -11614.842 -1.006 -0.190 0.012
##        Sheppard, Cory     14      66  95027.550  88455.199   6572.351  0.579  0.176 0.011
##        Downs, Deborah      7      90  57139.900  65256.982  -8117.082 -0.706 -0.174 0.010
##
##
##   FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast,
## 95% Prediction Intervals
##    [sorted by lower bound of prediction interval]
##    [to see all intervals do pred_rows="all"]
##  ----------------------------------------------
##
##                        Years    Pre     Salary       pred        sf    pi.lwr     pi.upr     width
##          Hamide, Bita      1     83  51036.850  45876.388 12290.483 20871.211  70881.564 50010.352
##          Singh, Niral      2     59  61055.440  49566.155 12619.291 23892.014  75240.296 51348.281
##   Anastasiou, Crystal      2     59  56508.320  49566.155 12619.291 23892.014  75240.296 51348.281
## ...
##          Link, Thomas     10     83  66312.890  75139.062 11933.518 50860.137  99417.987 48557.849
##          LaRoe, Maria     10     80  61961.290  75193.857 11918.048 50946.405  99441.308 48494.903
##   Cassinelli, Anastis     10     80  57562.360  75193.857 11918.048 50946.405  99441.308 48494.903
## ...
##       Correll, Trevon     21     97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747
##         Capelle, Adam     24     83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------

Or, work with the components individually. Use the base R names() function to identify all of the components. Component names that begin with out_ are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.

names(r)
##  [1] "out_suggest"     "call"            "formula"         "out_title_bck"   "out_background"  "out_title_basic"
##  [7] "out_estimates"   "out_fit"         "out_anova"       "out_title_kfold" "out_kfold"       "out_title_rel"
## [13] "out_cor"         "out_collinear"   "out_subsets"     "out_title_res"   "out_residuals"   "out_title_pred"
## [19] "out_predict"     "out_ref"         "out_Rmd"         "out_Word"        "out_pdf"         "out_odt"
## [25] "out_rtf"         "out_plots"       "n.vars"          "n.obs"           "n.keep"          "coefficients"
## [31] "sterrs"          "tvalues"         "pvalues"         "cilb"            "ciub"            "anova_model"
## [37] "anova_residual"  "anova_total"     "se"              "resid_range"     "Rsq"             "Rsqadj"
## [43] "PRESS"           "RsqPRESS"        "m_se"            "m_MSE"           "m_Rsq"           "cor"
## [49] "tolerances"      "vif"             "resid.max"       "pred_min_max"    "residuals"       "fitted"
## [55] "cooks.distance"  "model"           "terms"

Here just display the estimates as part of the standard text output.

r$out_estimates ## Estimate Std Err t-value p-value Lower 95% Upper 95% ## (Intercept) 44140.971 13666.115 3.230 0.003 16337.052 71944.891 ## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462 ## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825 Here display the coefficients as numeric values. r$coefficients
## (Intercept)       Years         Pre
## 44140.97140  3251.40825   -18.26496

Do a regression and request all prediction intervals. Then convert that output segment to a data frame named dp with base R read.table(). As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette).

This particular conversion to a data frame requires one more step. One more spaces delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub() to remove the space after the comma before converting to a data frame.

r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE)
r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE)
dp <- read.table(text=r$out_predict) dp[.(row.names(dp) == "Pham,Scott"),] ## Years Salary pred sf pi.lwr pi.upr width ## Pham,Scott 13 81871.05 84955.08 11805.96 60962.48 108947.7 47985.2 ## Contrasts Because reg() accomplishes its computations with base R function lm(), can pass lm() parameters to reg(), which then passes their values to lm(). Here first use base R function contr.sum() to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set. cnt <- contr.sum(n=3) cnt ## [,1] [,2] ## 1 1 0 ## 2 0 1 ## 3 -1 -1 Now use the lm() parameter contrasts to define the effect coding for Plan, passed to reg_brief(). Contrasts only apply to factors, so first convert Plan to a factor. d$Plan <- factor(d\$Plan)
reg_brief(Salary ~ Plan, contrasts=list(Plan=cnt))

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Plan, contrasts=list(Plan=cnt), Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
## Predictor Variable: Plan
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  37
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value    Lower 95%   Upper 95%
## (Intercept) 76737.724   3897.284   19.690    0.000    68817.491   84657.958
##       Plan1 -4166.287   5113.762   -0.815    0.421   -14558.703    6226.128
##       Plan2 -6866.355   4920.990   -1.395    0.172   -16867.009    3134.299
##
##
## Standard deviation of residuals:  21456.776 for 34 degrees of freedom
##
## R-squared:  0.085    Adjusted R-squared:  0.031    PRESS R-squared:  -0.133
##
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 1.580     df: 2 and 34     p-value:  0.221
##
##
##             df           Sum Sq        Mean Sq   F-value   p-value
## Model        2   1454537623.133  727268811.566     1.580     0.221
## Residuals   34  15653370109.356  460393238.510
## Salary      36  17107907732.489  475219659.236
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##   RESIDUALS AND INFLUENCE
##
##   FORECASTING ERROR

## Null Model

The $$R^2$$ fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable $$y$$. The corresponding slope intercept is the mean of $$y$$, and the standard deviation of the residuals is the standard deviation of $$y$$.

The following submits the null model for Salary, and plots the errors. Compare the standard deviation of the residuals to a regression model of Salary with one or more predictor variables.

reg_brief(Salary ~ 1, plot_errors=TRUE)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ 1, plot_errors=TRUE, Rmd="eg")
##
##
##   BACKGROUND
##
## Data Frame:  d
##
## Response Variable: Salary
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  37
##
##
##   BASIC ANALYSIS
##
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept) 73795.557   3583.821   20.591    0.000   66527.230   81063.883
##
##
## Standard deviation of residuals:  21799.533 for 36 degrees of freedom
##
##
##             df           Sum Sq        Mean Sq   F-value   p-value
## Residuals   36  17107907732.489  475219659.236
##
##
##   K-FOLD CROSS-VALIDATION
##
##   RELATIONS AMONG THE VARIABLES
##
##   RESIDUALS AND INFLUENCE
##
##   FORECASTING ERROR

Can also get the null model output from the lessR function Plot() with the fit parameter set to "null".

## Interpreted Output

The parameter Rmd creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.

Not run here.

reg(Salary ~ Years + Pre, Rmd="eg")

## Logistic Regression

Specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit(). The output includes the confusion matrix and various classification fit indices.

### Default Analysis

Logit(Gender ~ Salary)
##
## Response Variable:   Gender
## Predictor Variable 1:  Salary
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  37
##
##
##
##    BASIC ANALYSIS
##
## Model Coefficients
##
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)   -2.6191     1.3715   -1.910    0.056     -5.3073      0.0691
##      Salary    0.0000     0.0000    1.904    0.057     -0.0000      0.0001
##
##
## Odds ratios and confidence intervals
##
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.0729      0.0050      1.0715
##      Salary      1.0000      1.0000      1.0001
##
##
## Model Fit
##
##     Null deviance: 51.266 on 36 degrees of freedom
## Residual deviance: 46.918 on 35 degrees of freedom
##
## AIC: 50.91807
##
## Number of iterations to convergence: 4
##
##
##
##
##    ANALYSIS OF RESIDUALS AND INFLUENCE
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20 out of 37 cases (rows) of data]
## --------------------------------------------------------------------
##                     Salary Gender fitted residual rstudent  dffits   cooks
## James, Leslie       122563      F 0.8424  -0.8424  -2.1213 -0.7143 0.46299
## Langston, Matthew    49189      M 0.2900   0.7100   1.6237  0.3646 0.08559
## Osterman, Pascal     49705      M 0.2938   0.7062   1.6139  0.3586 0.08225
## Kralik, Laura        92681      F 0.6522  -0.6522  -1.4942 -0.3313 0.06402
## Ritchie, Darnell     53788      M 0.3243   0.6757   1.5380  0.3136 0.05962
## Skrotzki, Sara       91352      F 0.6416  -0.6416  -1.4698 -0.3161 0.05736
## Cassinelli, Anastis  57562      M 0.3539   0.6461   1.4703  0.2761 0.04409
## Link, Thomas         66313      M 0.4267   0.5733   1.3223  0.2111 0.02335
## Anderson, David      69548      M 0.4547   0.5453   1.2706  0.1967 0.01962
## Stanley, Grayson     69625      M 0.4553   0.5447   1.2694  0.1965 0.01955
## Capelle, Adam       108138      M 0.7632   0.2368   0.7586  0.2236 0.01954
## Knox, Michael        99063      M 0.7011   0.2989   0.8637  0.2179 0.01935
## Hoang, Binh         111075      M 0.7813   0.2187   0.7265  0.2228 0.01919
## Sheppard, Cory       95028      M 0.6706   0.3294   0.9132  0.2119 0.01869
## Wu, James            94495      M 0.6665   0.3335   0.9199  0.2110 0.01859
## Campagna, Justin     72321      M 0.4788   0.5212   1.2275  0.1888 0.01759
## Fulton, Scott        87786      M 0.6124   0.3876   1.0066  0.1980 0.01706
## Adib, Hassan         83014      M 0.5720   0.4280   1.0715  0.1892 0.01613
## Pham, Scott          81871      M 0.5622   0.4378   1.0875  0.1875 0.01599
## Portlock, Ryan       77715      M 0.5261   0.4739   1.1469  0.1841 0.01593
##
##
##    FORECASTS
##
## Probability threshold for predicting M: 0.5
##
##  0: F
##  1: M
##
## Data, Fitted Values, Standard Errors
##    [sorted by fitted value]
## --------------------------------------------------------------------
##                     Salary Gender predict fitted std.err
## Stanley, Emma        46125      F       0 0.2684  0.1161
## Langston, Matthew    49189      M       0 0.2900  0.1126
## Osterman, Pascal     49705      M       0 0.2938  0.1119
## Gvakharia, Kimberly  49869      F       0 0.2949  0.1117
##
## ... for the rows of data where fitted is close to 0.5 ...
##
##                    Salary Gender predict fitted std.err
## Campagna, Justin    72321      M       0 0.4788 0.08710
## Korhalkar, Jessica  72502      F       0 0.4804 0.08713
## Billing, Susan      72675      F       0 0.4819 0.08718
## Portlock, Ryan      77715      M       1 0.5261 0.09079
## Pham, Scott         81871      M       1 0.5622 0.09670
##
## ... for the last 4 rows of sorted data ...
##
##                 Salary Gender predict fitted std.err
## Capelle, Adam   108138      M       1 0.7632  0.1355
## Hoang, Binh     111075      M       1 0.7813  0.1364
## James, Leslie   122563      F       1 0.8424  0.1318
## Correll, Trevon 134419      M       1 0.8901  0.1174
## --------------------------------------------------------------------
##
##
## ----------------------------
## Specified confusion matrices
## ----------------------------
##
## Probability threshold for predicting M: 0.5
##
##                Baseline         Predicted
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct
## ---------------------------------------------------
##          0       19  51.4       16      3     84.2
## Gender   1       18  48.6        8     10     55.6
## ---------------------------------------------------
##        Total     37                           70.3
##
## Accuracy: 70.27
## Recall: 55.56
## Precision: 76.92

### Change Classifiation Threshold

Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut parameter.

Logit(Gender ~ Years + Salary, prob_cut=c(.3, .5, .7))
##
## Response Variable:   Gender
## Predictor Variable 1:  Years
## Predictor Variable 2:  Salary
##
## Number of cases (rows) of data:  37
## Number of cases retained for analysis:  36
##
##
##
##    BASIC ANALYSIS
##
## Model Coefficients
##
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)   -0.1922     1.5968   -0.120    0.904     -3.3218      2.9374
##       Years    0.4041     0.1801    2.244    0.025      0.0511      0.7571
##      Salary   -0.0001     0.0000   -1.317    0.188     -0.0001      0.0000
##
##
## Odds ratios and confidence intervals
##
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.8251      0.0361     18.8658
##       Years      1.4979      1.0524      2.1320
##      Salary      0.9999      0.9999      1.0000
##
##
## Model Fit
##
##     Null deviance: 49.795 on 35 degrees of freedom
## Residual deviance: 38.772 on 33 degrees of freedom
##
## AIC: 44.77244
##
## Number of iterations to convergence: 4
##
##
## Collinearity
##
##        Tolerance       VIF
## Years      0.274     3.655
## Salary     0.274     3.655
##
##
##
##    ANALYSIS OF RESIDUALS AND INFLUENCE
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20 out of 36 cases (rows) of data]
## --------------------------------------------------------------------
##                     Years Salary Gender fitted residual rstudent  dffits   cooks
## Skrotzki, Sara         18  91352      F 0.9174  -0.9174  -2.4427 -0.6774 0.35497
## James, Leslie          18 122563      F 0.6923  -0.6923  -1.7374 -0.8959 0.28507
## Hoang, Binh            15 111075      M 0.5465   0.4535   1.1867  0.5568 0.08266
## LaRoe, Maria           10  61961      F 0.6635  -0.6635  -1.5384 -0.4480 0.06884
## Langston, Matthew       5  49189      M 0.3344   0.6656   1.5340  0.4122 0.05852
## Osterman, Pascal        5  49705      M 0.3286   0.6714   1.5451  0.4083 0.05799
## Saechao, Suzanne        8  55545      F 0.5495  -0.5495  -1.3072 -0.3747 0.04143
## Bellingar, Samantha    10  66338      F 0.6118  -0.6118  -1.4120 -0.3359 0.03598
## Campagna, Justin        8  72321      M 0.3409   0.6591   1.5015  0.3235 0.03574
## Ritchie, Darnell        7  53788      M 0.4712   0.5288   1.2641  0.3408 0.03351
## Correll, Trevon        21 134419      M 0.8048   0.1952   0.7135  0.3834 0.03266
## Kralik, Laura          10  92681      F 0.2905  -0.2905  -0.8663 -0.3233 0.02467
## Cassinelli, Anastis    10  57562      M 0.7117   0.2883   0.8576  0.2983 0.02098
## Kimball, Claire         8  61357      F 0.4754  -0.4754  -1.1573 -0.2510 0.01725
## Downs, Deborah          7  57140      F 0.4288  -0.4288  -1.0800 -0.2502 0.01643
## Sheppard, Cory         14  95028      M 0.6464   0.3536   0.9581  0.2575 0.01638
## Stanley, Grayson        9  69625      M 0.4707   0.5293   1.2459  0.2329 0.01568
## Anderson, David         9  69548      M 0.4717   0.5283   1.2442  0.2327 0.01564
## Link, Thomas           10  66313      M 0.6121   0.3879   1.0112  0.2391 0.01450
## Fulton, Scott          13  87786      M 0.6387   0.3613   0.9649  0.2213 0.01217
##
##
##    FORECASTS
##
## Probability threshold for predicting M: 0.5
##
##  0: F
##  1: M
##
## Data, Fitted Values, Standard Errors
##    [sorted by fitted value]
## --------------------------------------------------------------------
##                    Years Salary Gender predict  fitted std.err
## Korhalkar, Jessica     2  72502      F       0 0.04339 0.05839
## Singh, Niral           2  61055      F       0 0.07533 0.07374
## Hamide, Bita           1  51037      F       0 0.08324 0.07439
## Billing, Susan         4  72675      F       0 0.09163 0.08914
##
## ... for the rows of data where fitted is close to 0.5 ...
##
##                  Years Salary Gender predict fitted std.err
## Ritchie, Darnell     7  53788      M       0 0.4712 0.13798
## Anderson, David      9  69548      M       0 0.4717 0.09822
## Kimball, Claire      8  61357      F       0 0.4754 0.11321
## Hoang, Binh         15 111075      M       1 0.5465 0.21933
## Saechao, Suzanne     8  55545      F       1 0.5495 0.14527
##
## ... for the last 4 rows of sorted data ...
##
##                 Years Salary Gender predict fitted std.err
## Correll, Trevon    21 134419      M       1 0.8048 0.19249
## Knox, Michael      18  99063      M       1 0.8822 0.09262
## Skrotzki, Sara     18  91352      F       1 0.9174 0.07830
## Capelle, Adam      24 108138      M       1 0.9815 0.02894
## --------------------------------------------------------------------
##
##
## ----------------------------
## Specified confusion matrices
## ----------------------------
##
## Probability threshold for predicting M: 0.3
##
##                Baseline         Predicted
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct
## ---------------------------------------------------
##          0       19  52.8       14      5     73.7
## Gender   1       17  47.2        6     11     64.7
## ---------------------------------------------------
##        Total     36                           69.4
##
## Accuracy: 69.44
## Recall: 64.71
## Precision: 68.75
##
##
##
## Probability threshold for predicting M: 0.5
##
##                Baseline         Predicted
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct
## ---------------------------------------------------
##          0       19  52.8       14      5     73.7
## Gender   1       17  47.2        6     11     64.7
## ---------------------------------------------------
##        Total     36                           69.4
##
## Accuracy: 69.44
## Recall: 64.71
## Precision: 68.75
##
##
##
## Probability threshold for predicting M: 0.7
##
##                Baseline         Predicted
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct
## ---------------------------------------------------
##          0       19  52.8       14      5     73.7
## Gender   1       17  47.2        6     11     64.7
## ---------------------------------------------------
##        Total     36                           69.4
##
## Accuracy: 69.44
## Recall: 64.71
## Precision: 68.75

## Full Manual

Use the base R help() function to view the full manual for Regression(). Simply enter a question mark followed by the name of the function, or its abbreviation.

?reg