Introduction to ltmle()

Minimal Example

Consider the observed data with time ordering W -> A -> Y. W is a continuous baseline covariate that affects A and Y. A is a binary treatment variable that affects Y. Y is a binary outcome variable.

W ~ N(0, 1)

A ~ binomial with P(A = 1) = expit(W)

Y ~ binomial with P(Y = 1) = expit(W + A)

where $$expit(z) = \frac{1}{1 + e^{-z}}$$

We want to know $$E[Y_1]$$ the expected value of Y, intervening to set A to 1.

rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x))
n <- 10000
W <- rnorm(n)
A <- rexpit(W)
Y <- rexpit(W + A)
data <- data.frame(W, A, Y)
#>             W A Y
#> 1 -1.00576427 1 0
#> 2 -0.16776940 0 0
#> 3 -0.47677696 1 1
#> 4 -0.43573417 0 0
#> 5  0.35557205 0 1
#> 6  0.02219503 1 1

We could try to use the simple mean of Y or the mean of Y when A = 1, but these will be biased.

mean(Y)
#>  0.5954
mean(Y[A == 1])
#>  0.7717826

Now we use ltmle().

result <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 1)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
#> Estimate of time to completion: < 1 minute
result
#> Call:
#> ltmle(data = data, Anodes = "A", Ynodes = "Y", abar = 1)
#>
#> TMLE Estimate:  0.6954446

Because we’re using simulated data, we can calculate the true value of $$E[Y_1]$$

n.large <- 1e6
W <- rnorm(n.large)
A <- 1
Y <- rexpit(W + A)
mean(Y)
#>  0.696219

Censoring

Time ordering of data is W -> C -> Y. C is a censoring node. The parameter of interest is Y, intervening to set C to be uncensored. Censoring nodes are very similar to treatment nodes. The main difference is that after censoring, other data is expected to be NA. It is assumed that the parameter of interest always intervenes to set censoring nodes to uncensored, so this is not specified in abar. g and Q regressions always stratify on censoring nodes, whereas g and Q regressions can either stratify or pool over treatment nodes (by using the stratify argument.)

Censoring nodes in data should be factors with two levels - “censored” and “uncensored”. The utility function BinaryToCensoring can be used to facilitate this.

n <- 100000
W <- rnorm(n)
C <- BinaryToCensoring(is.censored = rexpit(W))
summary(C)
#>   censored uncensored
#>      49910      50090
Y <- rep(NA, n)
Y[C == "uncensored"] <- rexpit(W[C == "uncensored"])
data <- data.frame(W, C, Y)
#>             W          C  Y
#> 1  -0.2987284   censored NA
#> 2   1.8769204 uncensored  1
#> 3  -0.9733347 uncensored  0
#> 4  -1.0332095 uncensored  1
#> 5   1.1169817   censored NA
#> 6   0.7518623   censored NA
#> 7   0.5282109   censored NA
#> 8  -0.6633313 uncensored  0
#> 9  -0.7219888   censored NA
#> 10  0.3909210   censored NA
#> 11 -0.4193815   censored NA
#> 12 -0.7178721 uncensored  0
#> 13  0.9187818   censored NA
#> 14  1.7185865   censored NA
#> 15 -0.4896426   censored NA
#> 16  1.2803184   censored NA
#> 17  0.3876463   censored NA
#> 18  0.1425409   censored NA
#> 19 -0.4874995 uncensored  0
#> 20 -0.2747150   censored NA
result <- ltmle(data, Anodes = NULL, Cnodes = "C", Ynodes = "Y", abar = NULL)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W
#>
#> gform not specified, using defaults:
#> formula for C:
#> C ~ W
#>
#> Estimate of time to completion: < 1 minute
summary(result)
#> Estimator:  tmle
#> Call:
#> ltmle(data = data, Anodes = NULL, Cnodes = "C", Ynodes = "Y",
#>     abar = NULL)
#>
#>    Parameter Estimate:  0.50057
#>     Estimated Std Err:  0.0023378
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.49599, 0.50516)

The naive estimate is biased (the true value is 0.5):

mean(data$Y) #>  NA mean(data$Y, na.rm = T)
#>  0.4152126

Longitudinal data

Time ordering of data is W -> A1 -> L -> A2 -> Y. L is time-dependent covariate because it occurs after a treatment (or censoring) variable, A1. We indicate this using the Lnodes argument.

n <- 1000
W <- rnorm(n)
A1 <- rexpit(W)
L <- 0.3 * W + 0.2 * A1 + rnorm(n)
A2 <- rexpit(W + A1 + L)
Y <- rexpit(W - 0.6 * A1 + L - 0.8 * A2)
data <- data.frame(W, A1, L, A2, Y)
#>            W A1           L A2 Y
#> 1 -1.2143715  0 -0.72291649  0 0
#> 2  0.2368600  1  0.57968819  1 0
#> 3 -0.7370930  0  0.35485803  0 1
#> 4  0.3061751  1  0.20143846  1 0
#> 5 -0.7264741  0 -0.08946743  0 0
#> 6 -1.2620376  0  0.12865984  0 0

Treatment regime of interest: set A1 to 0, set A2 to 0

ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0))
#> Qform not specified, using defaults:
#> formula for L:
#> Q.kplus1 ~ W + A1
#> formula for Y:
#> Q.kplus1 ~ W + A1 + L + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ W
#> formula for A2:
#> A2 ~ W + A1 + L
#>
#> Estimate of time to completion: < 1 minute
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Lnodes = "L", Ynodes = "Y",
#>     abar = c(0, 0))
#>
#> TMLE Estimate:  0.5235939

Variance estimation

Consider a simple point treatment problem with observed data $$W, A, Y$$. But there is a positivity problem - for small values of $$W$$, $$Prob(A = 1)$$ is very small.

The true parameter value, $$E[Y_1]$$ is approximately 0.697.

The true TMLE standard deviation (the standard deviation of the TMLE estimate if we ran it many times on many sets of data) is approximately 0.056.

The true IPTW standard deviation (the standard deviation of the IPTW estimate if we ran it many times on many sets of data) is approximately 0.059.

n <- 1000
W <- rnorm(n)
A <- rexpit(4 * W)
Y <- rexpit(W + A)
df <- data.frame(W, A, Y)

The default variance.method is "tmle" - use TMLE in order to approximate the variance of the TMLE estimator. The estimated standard deviation is close to the true TMLE standard deviation.

r1 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
print(summary(r1))
#> Estimator:  tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE)
#>
#>    Parameter Estimate:  0.67782
#>     Estimated Std Err:  0.057989
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.56417, 0.79148)

If variance.method is "ic", variance is estimated using the estimated Influence Curve. This is fast to compute, but may be significantly anticonservative in data with positivity violations.

r2 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE,
variance.method="ic")
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
#> Warning in CheckForVarianceWarning(inputs, g.ratio): Variance estimate is
#> based on influence curve only, which may be significantly anticonservative
#> because your data appears to contain positivity violations. It is recommended
#> to use variance.method='tmle' or variance.method='iptw' to obtain a more
#> robust variance estimate (but run time may be significantly longer). See
#> variance.method details in ?ltmle
print(summary(r2))
#> Estimator:  tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#>     variance.method = "ic")
#>
#>    Parameter Estimate:  0.67782
#>     Estimated Std Err:  0.037042
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.60522, 0.75042)

If variance.method is "iptw", then use IPTW in order to approximate the variance of the TMLE estimator. This is faster to compute than variance.method = "tmle" but less accurate (and slower to compute than variance.method = "ic" but more accurate).

r3 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE,
variance.method="iptw")
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
print(summary(r3))
#> Estimator:  tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#>     variance.method = "iptw")
#>
#>    Parameter Estimate:  0.67782
#>     Estimated Std Err:  0.045018
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.58959, 0.76606)

If we use the IPTW estimator, variance.method does not change the estimated standard deviation (it only affects the estimated standard deviation of the TMLE estimator).

print(summary(r1, estimator="iptw"))
#> Estimator:  iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE)
#>
#>    Parameter Estimate:  0.73313
#>     Estimated Std Err:  0.047001
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.64101, 0.82525)
print(summary(r2, estimator="iptw")) #the same - variance.method only affects TMLE
#> Estimator:  iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#>     variance.method = "ic")
#>
#>    Parameter Estimate:  0.73313
#>     Estimated Std Err:  0.047001
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.64101, 0.82525)
print(summary(r3, estimator="iptw")) #the same - variance.method only affects TMLE
#> Estimator:  iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#>     variance.method = "iptw")
#>
#>    Parameter Estimate:  0.73313
#>     Estimated Std Err:  0.047001
#>               p-value:  <2e-16
#>     95% Conf Interval: (0.64101, 0.82525)

We can see that the values of g are very small.

summary(r1$cum.g) #> V1 #> Min. :0.01000 #> 1st Qu.:0.06621 #> Median :0.53923 #> Mean :0.52195 #> 3rd Qu.:0.95837 #> Max. :1.00000 summary(r1$cum.g.unbounded)
#>        V1
#>  Min.   :0.0000019
#>  1st Qu.:0.0662084
#>  Median :0.5392299
#>  Mean   :0.5210000
#>  3rd Qu.:0.9583683
#>  Max.   :0.9999999
head(data.frame(df, g = r1$cum.g, g.unbounded = r1$cum.g.unbounded), 20)
#>              W A Y         g  g.unbounded
#> 1  -1.35648823 0 0 0.0100000 3.409752e-03
#> 2   1.26884814 1 1 0.9963765 9.963765e-01
#> 3  -0.37773620 1 0 0.1873937 1.873937e-01
#> 4   0.94972013 1 1 0.9858499 9.858499e-01
#> 5  -0.24114949 0 0 0.2932962 2.932962e-01
#> 6   1.21533567 1 1 0.9954428 9.954428e-01
#> 7   0.14571385 0 0 0.6867324 6.867324e-01
#> 8   0.40461617 1 1 0.8697448 8.697448e-01
#> 9   0.84557785 1 1 0.9780277 9.780277e-01
#> 10 -0.02310326 1 1 0.5146515 5.146515e-01
#> 11  0.37353042 1 1 0.8538312 8.538312e-01
#> 12 -2.46671406 0 1 0.0100000 2.883257e-05
#> 13 -0.42082160 0 0 0.1607860 1.607860e-01
#> 14 -0.13043009 0 1 0.4005665 4.005665e-01
#> 15 -0.26469350 0 0 0.2727496 2.727496e-01
#> 16  1.69489974 1 1 0.9994187 9.994187e-01
#> 17  0.13098866 1 1 0.6729456 6.729456e-01
#> 18  0.47752803 1 1 0.9013555 9.013555e-01
#> 19 -1.29563313 0 0 0.0100000 4.425675e-03
#> 20  1.06440181 1 1 0.9913126 9.913126e-01