Demonstration of latrend package

Niek Den Teuling

2021-04-14

This vignette describes the core functionality of the package by identifying common trends in the longitudinal dataset that is included with the package. We begin by loading the required package and the latrendData dataset.

library(latrend)
data(latrendData)

Initial data exploration

The latrendData is a synthetic dataset for which the reference group of each trajectory is available, as indicated by the Class column. We will use this column at the end of this vignette to validate the identified model.

head(latrendData)
#>   Id      Time           Y   Class
#> 1  1 0.0000000 -1.08049205 Class 1
#> 2  1 0.2222222 -0.68024151 Class 1
#> 3  1 0.4444444 -0.65148373 Class 1
#> 4  1 0.6666667 -0.39115398 Class 1
#> 5  1 0.8888889 -0.19407876 Class 1
#> 6  1 1.1111111 -0.02991783 Class 1

Many of the functions of the package require the specification of the trajectory identifier variable (named Id) and the time variable (named Time). For convenience, we specify these variables as package options.

options(latrend.id = "Id", latrend.time = "Time")

Prior to attempting to model the data, it is worthwhile to visually inspect it.

plotTrajectories(latrendData, response = "Y")

Visualizing the trajectories of the latrend dataset.

Visualizing the trajectories of the `latrend` dataset.

The presence of clusters is not apparent from the plot. With any longitudinal analysis, one should first consider whether clustering brings any benefit to the representation of heterogeneity of the data over a single common trend representation, or a multilevel model. In this demonstration, we omit this step under the prior knowledge that the data was generated via distinct mechanisms.

Non-parametric trajectory clustering

Assuming the appropriate (cluster) trajectory model is not known in advance, non-parametric longitudinal cluster models can provide a suitable starting point.

As an example, we apply longitudinal \(k\)-means (KML). First, we need to define the method. At the very least we need to indicate the response variable the method should operate on. Secondly, we should indicate how many clusters we expect. We do not need to define the id and time arguments as we have set these as package options. We use the nbRedrawing argument provided by the KML package for reducing the number of repeated random starts to only a single model estimation, in order to reduce the run-time of this example.

kmlMethod <- lcMethodKML(response = "Y", nClusters = 2, nbRedrawing = 1)

kmlMethod
#> lcMethodKML as "longitudinal k-means (KML)"
#>  nbRedrawing:    1
#>  maxIt:          200
#>  imputationMethod:"copyMean"
#>  distanceName:   "euclidean"
#>  power:          2
#>  distance:       function() {}
#>  centerMethod:   meanNA
#>  startingCond:   "nearlyAll"
#>  nbCriterion:    1000
#>  scale:          TRUE
#>  response:       "Y"
#>  time:           getOption("latrend.time")
#>  id:             getOption("latrend.id")
#>  nClusters:      2

As seen in the output from the lcMethodKML object, the KML method is defined by additional arguments. These are specific to the kml package.

The KML model is estimated on the dataset via the latrend function.

kmlModel <- latrend(kmlMethod, data = latrendData)
#>  ~ Fast KmL ~
#> *

Now that we have fitted the KML model with 2 clusters, we can print a summary by calling:

kmlModel
#> Longitudinal cluster model using longitudinal k-means (KML)
#>  nClusters:      2
#>  id:             "Id"
#>  time:           "Time"
#>  response:       "Y"
#>  scale:          TRUE
#>  nbCriterion:    1000
#>  startingCond:   "nearlyAll"
#>  centerMethod:   function (x) {    mean(x, na.rm = TRUE)}
#>  distance:       function () {}
#>  power:          2
#>  distanceName:   "euclidean"
#>  imputationMethod:"copyMean"
#>  maxIt:          200
#>  nbRedrawing:    1
#> 
#> Cluster sizes (K=2):
#>         A         B 
#> 110 (55%)  90 (45%) 
#> 
#> Number of obs: 2000, strata (Id): 200
#> 
#> Scaled residuals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -3.83836 -0.60818  0.02327  0.00000  0.64356  3.49294

Identifying the number of clusters

As we do not know the best number of clusters needed to represent the data, we should consider fitting the KML model for a range of clusters. We can then select the best representation by comparing the solutions by one or more cluster metrics.

We can specify a range of lcMethodKML methods based on a prototype method using the lcMethods function. This method outputs a list of lcMethod objects. A structured summary is obtained by calling as.data.frame.

kmlMethods <- lcMethods(kmlMethod, nClusters = 1:8)

as.data.frame(kmlMethods)
#>        .class nClusters                      id                      time
#> 1 lcMethodKML         1 getOption("latrend.id") getOption("latrend.time")
#> 2 lcMethodKML         2 getOption("latrend.id") getOption("latrend.time")
#> 3 lcMethodKML         3 getOption("latrend.id") getOption("latrend.time")
#> 4 lcMethodKML         4 getOption("latrend.id") getOption("latrend.time")
#> 5 lcMethodKML         5 getOption("latrend.id") getOption("latrend.time")
#> 6 lcMethodKML         6 getOption("latrend.id") getOption("latrend.time")
#> 7 lcMethodKML         7 getOption("latrend.id") getOption("latrend.time")
#> 8 lcMethodKML         8 getOption("latrend.id") getOption("latrend.time")
#>   response scale nbCriterion startingCond centerMethod      distance power
#> 1        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 2        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 3        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 4        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 5        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 6        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 7        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#> 8        Y  TRUE        1000    nearlyAll       meanNA function() {}     2
#>   distanceName imputationMethod maxIt nbRedrawing
#> 1    euclidean         copyMean   200           1
#> 2    euclidean         copyMean   200           1
#> 3    euclidean         copyMean   200           1
#> 4    euclidean         copyMean   200           1
#> 5    euclidean         copyMean   200           1
#> 6    euclidean         copyMean   200           1
#> 7    euclidean         copyMean   200           1
#> 8    euclidean         copyMean   200           1

The list of lcMethod objects can be fitted using the latrendBatch function, returning a list of lcModel objects.

kmlModels <- latrendBatch(kmlMethods, data = latrendData, verbose = FALSE)

kmlModels
#> List of 8 lcModels with
#>   .name .method nClusters
#> 1     1     kml         1
#> 2     2     kml         2
#> 3     3     kml         3
#> 4     4     kml         4
#> 5     5     kml         5
#> 6     6     kml         6
#> 7     7     kml         7
#> 8     8     kml         8

We can compare each of the solutions via one or more cluster metrics. Considering the consistent improvements achieved by KML for an increasing number of clusters, identifying the best solution by minimizing a metric would lead to an overestimation. Instead, we perform the selection via a manual elbow method, using the plotMetric function.

plotMetric(kmlModels, c("logLik", "BIC", "WMAE"))

Elbow plots of three relevant cluster metrics across the fitted models.

Elbow plots of three relevant cluster metrics across the fitted models.

Investigating the preferred model

We have selected the 4-cluster model as the preferred representation. We will now inspect this solution in more detail. Before we can start, we first obtain the fitted lcModel object from the list of fitted models.

kmlModel4 <- subset(kmlModels, nClusters == 4, drop = TRUE)

kmlModel4
#> Longitudinal cluster model using longitudinal k-means (KML)
#>  nbRedrawing:    1
#>  maxIt:          200
#>  imputationMethod:"copyMean"
#>  distanceName:   "euclidean"
#>  power:          2
#>  distance:       function () {}
#>  centerMethod:   function (x) {    mean(x, na.rm = TRUE)}
#>  startingCond:   "nearlyAll"
#>  nbCriterion:    1000
#>  scale:          TRUE
#>  response:       "Y"
#>  time:           "Time"
#>  id:             "Id"
#>  nClusters:      4
#> 
#> Cluster sizes (K=4):
#>          A          B          C          D 
#>  100 (50%) 54 (27.3%) 25 (12.4%) 21 (10.3%) 
#> 
#> Number of obs: 2000, strata (Id): 200
#> 
#> Scaled residuals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -4.58872 -0.60854  0.04235  0.00000  0.65970  3.23364

The plotClusterTrajectories function shows the estimated cluster trajectories of the model.

plotClusterTrajectories(kmlModel4)

Cluster trajectories for KML model with 4 clusters.

Cluster trajectories for KML model with 4 clusters.

We can get a better sense of the representation of the cluster trajectories when plotted against the trajectories that have been assigned to the respective cluster.

plot(kmlModel4)

Cluster trajectories for KML model with 4 clusters, along with the assigned trajectories.

Cluster trajectories for KML model with 4 clusters, along with the assigned trajectories.

Model adequacy

The list of currently supported internal model metrics can be obtained by calling the getInternalMetricNames function.

getInternalMetricNames()
#>  [1] "AIC"             "APPA"            "BIC"             "MAE"            
#>  [5] "MSE"             "RSS"             "WMAE"            "WMSE"           
#>  [9] "WRSS"            "converged"       "deviance"        "entropy"        
#> [13] "estimationTime"  "logLik"          "relativeEntropy" "sigma"

As an example, we will compute the APPA (a measure of cluster separation), and the WRSS and WMAE metrics (measures of model error).

metric(kmlModel, c("APPA", "WRSS", "WMAE"))
#>        APPA        WRSS        WMAE 
#>   0.9996480 167.8058270   0.1123596

The quantile-quantile (QQ) plot can be used to assess the model for structural deviations.

qqPlot(kmlModel4)
#> Loading required namespace: qqplotr

QQ-plot of the selected KML model.

QQ-plot of the selected KML model.

Overall, the unexplained errors closely follow a normal distribution. In situations where structural deviations from the expected distribution are apparent, it may be fruitful to investigate the QQ plot on a per-cluster basis.

qqPlot(kmlModel4, byCluster = TRUE, detrend = TRUE)

Cluster-specific detrended QQ-plot for the selected KML model.

Cluster-specific detrended QQ-plot for the selected KML model.

Parametric trajectory clustering

Group-based trajectory modeling

The KML analysis has provided us with clues on an appropriate model for the cluster trajectories. We can use these insights to define a parametric group-based trajectory model (GBTM) with cluster trajectories represented by polynomials of order 2. We will use the GBTM implementation available from the lcmm package, using a B-spline trajectory of degree 3 from the splines package.

library(splines)
gbtmMethod <- lcMethodLcmmGBTM(fixed = Y ~ bs(Time), mixture = fixed)

gbtmMethod
#> lcMethodLcmmGBTM as "group-based trajectory modeling using lcmm"
#>  idiag:          FALSE
#>  nwg:            FALSE
#>  cor:            NULL
#>  convB:          1e-04
#>  convL:          1e-04
#>  convG:          1e-04
#>  maxiter:        500
#>  subset:         NULL
#>  na.action:      1
#>  posfix:         NULL
#>  fixed:          Y ~ bs(Time)
#>  mixture:        fixed
#>  classmb:        ~1
#>  time:           getOption("latrend.time")
#>  id:             getOption("latrend.id")
#>  nClusters:      2

We fit the GBTM for 1 to 5 clusters.

gbtmMethods <- lcMethods(gbtmMethod, nClusters = 1:5)

gbtmModels <- latrendBatch(gbtmMethods, data = latrendData, verbose = FALSE)
plotMetric(gbtmModels, c("logLik", "BIC", "WMAE"))

Three cluster metrics for each of the GBTMs.

Three cluster metrics for each of the GBTMs.

All metrics clearly point to the 3-cluster solution.

bestGbtmModel <- subset(gbtmModels, nClusters == 3, drop=TRUE)
plot(bestGbtmModel)

Growth mixture modeling

We have identified 3 apparent clusters via GBTM. However, we may wish to obtain a model that betters explains the within-cluster heterogeneity (as seen in the cluster trajectories plot above). We apply a growth mixture model (GMM) using trajectories represented by a second-order polynomial.

gmmMethod <- lcMethodLcmmGMM(fixed = Y ~ poly(Time, 2, raw = TRUE), mixture = fixed, idiag = TRUE)

gmmMethod
#> lcMethodLcmmGMM as "growth mixture model"
#>  idiag:          TRUE
#>  nwg:            FALSE
#>  cor:            NULL
#>  convB:          1e-04
#>  convL:          1e-04
#>  convG:          1e-04
#>  maxiter:        500
#>  na.action:      1
#>  posfix:         NULL
#>  fixed:          Y ~ poly(Time, 2, raw = TRUE)
#>  mixture:        fixed
#>  random:         ~1
#>  classmb:        ~1
#>  time:           getOption("latrend.time")
#>  id:             getOption("latrend.id")
#>  nClusters:      2

We fit the GMM for 1 to 5 clusters.

gmmMethods <- lcMethods(gmmMethod, nClusters = 1:5)

gmmModels <- latrendBatch(gmmMethods, latrendData, verbose = FALSE)
plotMetric(gmmModels, c("logLik", "BIC", "WMAE"))

Three cluster metrics for each of the GMMs.

Three cluster metrics for each of the GMMs.

Similar to the GBTM analysis, the metrics clearly point to the 3-cluster solution.

bestGmmModel <- subset(gmmModels, nClusters == 3, drop=TRUE)
plot(bestGmmModel)

Cluster trajectories of the selected GMM, including the assigned trajectories.

Cluster trajectories of the selected GMM, including the assigned trajectories.
qqPlot(bestGmmModel, detrend = TRUE)

Detrended QQ plot for the best GMM.

Detrended QQ plot for the best GMM.

As can be seen by the scale of the y-axis of the QQ plot, the deviations from the expected normal distributions are negligibly small.

Comparing models

Both the GBTM and GMM analysis indicated the data was best represented using 3 clusters. We are interested in identifying the best representation for the data. After all, despite the matching number of clusters, there may be differences between the models in terms of trajectory assignments are shape of the cluster trajectories.

Firstly, we compare the fit to the data by comparing the BIC, which indicates that the GMM model is marginally better than the GBTM model.

BIC(bestGbtmModel, bestGmmModel)
#>               df        BIC
#> bestGbtmModel 15   574.0813
#> bestGmmModel  13 -1672.9493

Secondly, we compare the model fit errors, showing an identical error between models.

metric(list(bestGbtmModel, bestGmmModel), 'WMAE')
#> [1] 0.06408468 0.02951830

Thirdly, we can compare the weighted minimum MAE between the cluster trajectories of the models, as a measure of agreement in cluster trajectory shapes.

externalMetric(bestGbtmModel, bestGmmModel, 'WMMAE')
#>      WMMAE 
#> 0.05559309

Lastly, we evaluate the agreement in trajectory assignments to the clusters using the adjusted Rand index (ARI). This intuitive index considers the co-assignments of trajectories, and corrects for agreements between the partitionings by random chance. A score of zero indicates a match no better than chance, whereas a score of 1 indicates a perfect agreement.

externalMetric(bestGbtmModel, bestGmmModel, 'adjustedRand')
#> adjustedRand 
#>    0.4770003

The ARI of 1.0 demonstrates that the cluster assignments of the models are identical.

Comparison to the reference

Since we have established a preferred clustered representation of the data heterogeneity, we can now compare the resulting cluster assignments to the ground truth from which the latrendData data was generated.

Using the reference assignments, we can also plot a non-parametric estimate of the cluster trajectories. Note how it looks similar to the cluster trajectories found by our model.

plotClusterTrajectories(latrendData, response = "Y", cluster = "Class")

Non-parametric estimates of the cluster trajectories based on the reference assignments.

Non-parametric estimates of the cluster trajectories based on the reference assignments.

In order to compare the reference assignments to the trajectory assignments generated by our model, we can create a lcModel object based on the reference assignments using the lcModelPartition function.

refTrajAssigns <- aggregate(Class ~ Id, data = latrendData, FUN = data.table::first)
refModel <- lcModelPartition(data = latrendData, response = "Y", trajectoryAssignments = refTrajAssigns$Class)
refModel
#> Longitudinal cluster model using part
#>  no arguments
#> 
#> Cluster sizes (K=3):
#>        A        B        C 
#> 90 (45%) 80 (40%) 30 (15%) 
#> 
#> Number of obs: 2000, strata (Id): 200
#> 
#> Scaled residuals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -3.33030 -0.59503 -0.02617  0.00000  0.59606  3.40645
plot(refModel)

Cluster trajectories of the reference model.

Cluster trajectories of the reference model.

By constructing a reference model, we can make use of the standardized way in which lcModel objects can be compared. A list of supported comparison metrics can be obtained via the getExternalMetricNames function.

getExternalMetricNames()
#>  [1] "CohensKappa"    "F"              "F1"             "FolkesMallows" 
#>  [5] "Hubert"         "Jaccard"        "Kulczynski"     "MI"            
#>  [9] "MaximumMatch"   "McNemar"        "MeilaHeckerman" "Mirkin"        
#> [13] "NMI"            "NSJ"            "NVI"            "Overlap"       
#> [17] "PD"             "Phi"            "Rand"           "RogersTanimoto"
#> [21] "RusselRao"      "SMC"            "SokalSneath1"   "SokalSneath2"  
#> [25] "VI"             "WMMAE"          "WMMAE_ref"      "WMMSE"         
#> [29] "WMMSE_ref"      "WMSSE"          "WMSSE_ref"      "Wallace1"      
#> [33] "Wallace2"       "adjustedRand"   "jointEntropy"   "precision"     
#> [37] "recall"         "splitJoin"      "splitJoin_ref"

Lastly, we compare the agreement in trajectory assignments via the adjusted Rand index.

externalMetric(bestGmmModel, refModel, "adjustedRand")
#> adjustedRand 
#>    0.9883837

With a score of externalMetric(bestGmmModel, refModel, "adjustedRand"), we have a near-perfect match. This result is expected, as the dataset was generated using a growth mixture model.