Introduction to the LUCID (Latent Unknown Clustering with Integrated
Data)
Multi-omics data combined with the phenotypic trait are integrated by
jointly modeling their relationships through a latent cluster variable,
which is illustrated by the directed acyclic graph (DAG) below. (A
screenshot from LUCID
paper)

Let \(\mathbf{G}\) be a \(n \times p\) matrix with columns
representing genetic features/environmental exposures, and rows being
the observations; \(\mathbf{Z}\) be a
\(n \times m\) matrix of standardized
biomarkers and \(\mathbf{Y}\) be a
\(n\)-dimensional vector of disease
outcome. By the DAG graph, it is further assumed that all three
components above are linked by a categorical latent cluster variable
\(\mathbf{X}\) of \(K\) classes and with the conditional
independence implied by the DAG, the general joint likelihood of the
LUCID model can be formalized into \[\begin{equation}
\begin{aligned}
l(\mathbf{\Theta}) & = \sum_{i = 1}^n\log f(\mathbf{Z}_i,
Y_i|\mathbf{G_i}; \mathbf{\Theta}) \\
& = \sum_{i = 1}^n \log \sum_{j = 1}^K f(\mathbf{Z}_i|X_i =
j; \mathbf{\Theta}_j) f(Y_i|X_i = j; \mathbf{\Theta}_j) f(X_i =
j|\mathbf{G}_i; \mathbf{\Theta}_j)
\end{aligned}
\end{equation}\] where \(\mathbf{\Theta}\) is a generic notation
standing for parameters associated with each probability model.
Additionally, we assume \(\mathbf{X}\)
follows a multinomial distribution conditioning on \(\mathbf{G}\), \(\mathbf{Z}\) follows a multivariate normal
distribution conditioning on \(\mathbf{X}\) and \(\mathbf{Y}\) follows a normal/Bernoulli
(depending on the specific data structure of disease outcome)
distribution conditioning on \(\mathbf{X}\). Therefore, the equation above
can be finalized as \[\begin{equation}
\begin{aligned}
l(\mathbf{\Theta}) = \sum_{i = 1}^n \log \sum_{j = 1}^k
S(\mathbf{G}_i; \boldsymbol{\beta}_j) \phi(\mathbf{Z}_i;
\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j)
\end{aligned}
\end{equation}\] where \(S\)
denotes the softmax function and \(\phi\) denotes the probability density
function (pdf) of the multivariate normal distribution.
To obtain the maximum likelihood estimates (MLE) of the model
parameters, an EM algorithm is applied to handle the latent variable
\(\mathbf{X}\). Denote the observed
data as \(\mathbf{D}\), then the
posterior probability of observation \(i\) being assigned to latent cluster \(j\) is expressed as \[\begin{equation}
\begin{aligned}
r_{ij} & = P(X_i = j|\mathbf{D}, \mathbf{\Theta}) \\
& = \frac{S(\mathbf{G}_i; \boldsymbol{\beta}_j)
\phi(\mathbf{Z}_i; \boldsymbol{\mu}_j,
\boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j)}{\sum_{j = 1}^k
S(\mathbf{G}_i; \boldsymbol{\beta}_j) \phi(\mathbf{Z}_i;
\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)f(Y_i;\mathbf{\Theta}_j)}
\end{aligned}
\end{equation}\] and the expectation of the complete log
likelihood can be written as \[\begin{equation}
\begin{aligned}
Q(\mathbf{\Theta}) = \sum_{i = 1}^n\sum_{j = 1}^k
r_{ij}\log\frac{S(\mathbf{G}_i; \boldsymbol{\beta}_j)}{r_{ij}} + \sum_{i
= 1}^n\sum_{j = 1}^k r_{ij}\log\frac{\phi(\mathbf{Z}_i;
\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{r_{ij}} + \sum_{i =
1}^n\sum_{j = 1}^k r_{ij}\log\frac{f(Y_i;
\boldsymbol{\Theta}_j)}{r_{ij}}
\end{aligned}
\end{equation}\] At each iteration, in the E-step, we compute
\(r_{ij}\); in the M-step, parameters
are updated by maximizing the expected complete likelihood function
\(Q(\mathbf{\Theta})\). Details of EM
algorithm for LUDID can be found in Peng (2019).
Fit LUCID model
We use est.lucid()
function to fit LUCID model.
library(LUCIDus)
# use simulated data
G <- sim_data$G
Z <- sim_data$Z
Y_normal <- sim_data$Y_normal
Y_binary <- sim_data$Y_binary
cov <- sim_data$Covariate
# fit LUCID model with continuous outcome
fit1 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008)
# fit LUCID model with binary outcome
fit2 <- est.lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2, seed = 1008)
# fit LUCID model with covariates
fit3 <- est.lucid(G = G, Z = Z, Y = Y_binary, CoY = cov, family = "binary", K = 2, seed = 1008)
User should be aware of option useY
. By default,
useY = TRUE
, which means we fit LUCID model in a supervised
fashion and incorporate information of outcome to define clusters. On
the other hand, by setting useY = FALSE
, we are fitting
LUCID model in an unsupervised fashion, only using information from
exposure and omics data to define clusters.
# fit LUCID model without useing information from outcome
fit4 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, useY = FALSE, seed = 1008)
For LUCID model, we use mclust
method to optimize
likelihood related to omcis data with respect to mean and covariance
structure. By default, a VVV
model is used to estimate
covariance structure for omics data, which means the volume, shape and
orientation of covariance matrices varies across different clusters.
User can set modelName = NULL
to let est.lucid
select the optimal covariance model, or specify a specific covariance
model for LUCID. Details of available models can be found in
mclust::mclustModelNames
under multivariate mixture
section.
# fit LUCID model with automatic selection on optimal covariance models
fit5 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, modelName = NULL, seed = 1008)
# check the optimal model
fit5$modelName
# fit LUCID model with a specified covariance model
fit6 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, modelName = "EII", seed = 1008)
In terms of initialization of EM algorithm, est.lucid
uses either random initialization based on uniform distribution or a
heuristic method based on mclust
. By default,
initialization of EM algorithm uses mclsut
. However, if it
takes long time, we recommend user turn to random initialization by
setting init_par = "random
.
# initialize EM algorithm by mclust
fit7 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, init_par = "mclust" , seed = 1008)
# initialize EM algorithm via randomization
fit8 <- est.lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, init_par = "random" , seed = 1008)
To ensure replicable results, user can set random seed by using the
option seed
. The LUCID model can be summarized via
summary_lucid()
.
# summarize lucid results
summary_lucid(fit1)
Visualization of LUCID model
We use a Sankey diagram to visualize LUCID model. In the Sankey
diagram, each node represents a variable in LUCID model (exposure, omics
data and outcome), each line corresponds to an association between two
variables. The color of line indicates the direction of association (by
default, dark blue refers to positive association while light blue
refers to negative association) and the width of line indicates the
magnitude of association (large effect corresponds to wider line).
# visualze lucid model via a Snakey diagram
plot_lucid(fit1)
# change node color
plot_lucid(fit1, G_color = "yellow")
plot_lucid(fit1, Z_color = "red")
# change link color
plot_lucid(fit1, pos_link_color = "red", neg_link_color = "green")
LUCID incorporating missing omics data
The latest version of LUCID allows missingness in omics data. We
consider 2 missing patterns in omics data: (1) list-wise missing pattern
that only a subset of observations have measured omics data and (2)
sporadic missing pattern that missingness is completely at random. We
implement a likelihood partition for (1) and an integrated imputation
based EM algorithm for (2).
# fit LUCID model with block-wise missing pattern in omics data
Z_miss_1 <- Z
Z_miss_1[sample(1:nrow(Z), 0.3 * nrow(Z)), ] <- NA
fit9 <- est.lucid(G = G, Z = Z_miss_1, Y = Y_normal, family = "normal", K = 2)
# fit LUCID model with sporadic missing pattern in omics data
Z_miss_2 <- Z
index <- arrayInd(sample(length(Z_miss_2), 0.3 * length(Z_miss_2)), dim(Z_miss_2))
Z_miss_2[index] <- NA
fit10 <- est.lucid(G = G, Z = Z_miss_2, Y = Y_normal, family = "normal", K = 2, seed = 1008)
For sporadic missing pattern, est.lucid
uses
mclust::imputeData
to initialize the imputation. The
imputed omics data is also returned by est.lucid
.
# imputed omics dataset
fit10$Z
Variable selection
In many situations, using all variables for clustering analysis is
not an optimal option. For example, some variables may not possess any
information related to cluster structure. Including these
non-informative variables only add ”noises” to cluster assignment and
unnecessarily increases the model complexity. For situations of moderate
or low dimensionality, variable selection is able to facilitate model
interpretation and identify the correct cluster structure. The
integrated variable selection for LUCID is built upon penalized EM
algorithm. We adapt the penalty function proposed by Zhou et
al. (Electronic journal of statistics, 2009) \[\begin{equation}
p_{\lambda}(\Theta) = \lambda_{ \mu}\sum_{j=1}^k\sum_{l=1}^m|\mu_{jl}| +
\lambda_{ W}\sum_{j=1}^k \sum_{l,o}|w_{j;l,o}|
\end{equation}\] where \(W =
\Sigma^{-1}\).
# use LUCID model to conduct integrated variable selection
# select exposure
fit6 <- est.lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal",
K = 2, seed = 1008, Rho_G = 0.1)
# select omics data
fit7 <- est.lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal",
K = 2, seed = 1008, Rho_Z_Mu = 90, Rho_Z_Cov = 0.1, init_par = "random")
Model selection
For clustering analysis, usually we need to specify the number of
clusters a priori. For LUCID, we use Bayesian Information Criteria (BIC)
to choose the optimal number of latent clusters. The strategy is to fit
LUCID model over a grid of \(K\), and
choose the optimal model with lowest BIC. This can be done via the
wrapper function lucid
.
# tune lucid over a grid of K (note this function may take time to run)
tune_lucid <- lucid(G = G, Z = Z, Y = Y_normal, K =2:5)
Check the tuning process by
> tune_lucid$tune_list
K Rho_G Rho_Z_Mu Rho_Z_Cov BIC
1 2 0 0 0 46917.94
2 3 0 0 0 47716.94
3 4 0 0 0 48520.86
4 5 0 0 0 49276.93
The optimal model (K = 2) is returned as
Bootstrap inference
We use a bootstrap method to conduct inference for LUCID.
# conduct bootstrap resampling
boot1 <- boot.lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100)
# use 90% CI
boot2 <- boot.lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100, conf = 0.9)
Diagnostic plot for bootstranp sampling is drawn by
# check distribution for bootstrap replicates of the variable of interest
plot(boot1$bootstrap, 1)