%\VignetteIndexEntry{01: Analysis of follow-up data using the Lexis functions in Epi}
\SweaveOpts{results=verbatim, keep.source=TRUE, include=FALSE, eps=FALSE}
\documentclass[a4paper, dvipsnames, twoside, 12pt]{report}
\newcommand{\Title}{Analysis of follow-up data\\ using the \texttt{Lexis} functions in
\texttt{Epi}}
\newcommand{\Tit}{Follow-up with \texttt{Lexis}}
\newcommand{\Version}{Version 9}
\newcommand{\Dates}{June 2024}
\newcommand{\Where}{SDCC}
\newcommand{\Homepage}{\url{http://bendixcarstensen.com/} }
\newcommand{\Faculty}{\begin{tabular}{rl}
Bendix Carstensen
& Steno Diabetes Center Copenhagen, Herlev, Denmark\\
& {\small \& Department of Biostatistics,
University of Copenhagen} \\
& \texttt{b@bxc.dk} \\
& \url{http://BendixCarstensen.com} \\[1em]
\end{tabular}}
\input{topreport}
\renewcommand{\rwpre}{./flup}
\chapter*{Introduction}
\addcontentsline{toc}{chapter}{Introduction}
\section{Teachnicalities}
First we set some graphics parameters and load the packages needed:
<<>>=
options(width = 90,
SweaveHooks=list(fig=function()
par(mar = c(3, 3, 1, 1),
mgp = c(3, 1, 0) / 1.6,
las = 1,
lend = "butt",
bty = "n")))
library(Epi)
library(popEpi)
library(survival)
clear()
@
% must be after clear() because 'anfang' is used at the end
<>=
anfang <- Sys.time()
cat("Start time:", format(anfang, "%F, %T"), "\n")
@ %
<<>>=
installed.packages()[c("Epi", "popEpi"),
c("Version", "Built"), drop = FALSE]
@ %
\section{About this vignette}
This vignette is an introduction to the \texttt{Lexis} machinery in the
\texttt{Epi} package, intended for representation and
manipulation of follow-up data (``event history data'') from studies where
exact dates of events are known. It accommodates follow-up through
multiple states and on multiple time scales.
This vignette uses an example from the \texttt{Epi} package to
illustrate the set-up of a simple \texttt{Lexis} object (a data frame
of follow-up intervals), as well as the subdivision of follow-up
intervals needed for multistate representation and analysis of
transition rates by flexible parametric functions.
The first chapter is exclusively on manipulation of the follow-up
representation, but it points to the subsequent chapter where analysis
is based on a \texttt{Lexis} representation with very small follow-up
intervals.
Chapter 2 uses analysis of mortality rates among Danish diabetes
patients (available in the \texttt{Epi} package) currently on insulin
treatment or not, to illustrate the use of \texttt{Lexis} object in
the analysis of rates.
\section{History}
The \texttt{Lexis} machinery in the \texttt{Epi} package was first
conceived by Martyn Plummer\cite{Plummer.2011, Carstensen.2011a}, and
since its first appearance in the \texttt{Epi} package in 2008 it has
been expanded with a number of utilities. An overview of these can be
found in the last chapter of this note, \hyperlink{lexfun}{``\texttt{Lexis} functions''}.
\chapter{Representation of follow-up data in the \texttt{Epi} package}
In the \texttt{Epi}-package, follow-up data is represented by adding
some extra variables and a few attributes to a data frame. Such a data
frame is called a \texttt{Lexis} object. The tools for handling
follow-up data then use the structure of this for special plots,
tabulations and modeling.
Follow-up data basically consists of a time of entry, a time of exit
and an indication of the status at exit (normally either ``alive'' or
``dead'') for each person. Implicitly is also assumed a status
\emph{during} the follow-up (usually ``alive'').
\begin{figure}[htbp]
\centering
\setlength{\unitlength}{1pt}
\begin{picture}(210, 70)(0, 75)
%\scriptsize
\thicklines
\put( 0, 80){\makebox(0, 0)[r]{Age-scale}}
\put( 50, 80){\line(1, 0){150}}
\put( 50, 80){\line(0, 1){5}}
\put(100, 80){\line(0, 1){5}}
\put(150, 80){\line(0, 1){5}}
\put(200, 80){\line(0, 1){5}}
\put( 50, 77){\makebox(0, 0)[t]{35}}
\put(100, 77){\makebox(0, 0)[t]{40}}
\put(150, 77){\makebox(0, 0)[t]{45}}
\put(200, 77){\makebox(0, 0)[t]{50}}
\put( 0, 115){\makebox(0, 0)[r]{Follow-up}}
\put( 80, 105){\makebox(0, 0)[r]{\small Two}}
\put( 90, 105){\line(1, 0){87}}
\put( 90, 100){\line(0, 1){10}}
\put(100, 100){\line(0, 1){10}}
\put(150, 100){\line(0, 1){10}}
\put(180, 105){\circle{6}}
\put( 95, 110){\makebox(0, 0)[b]{1}}
\put(125, 110){\makebox(0, 0)[b]{5}}
\put(165, 110){\makebox(0, 0)[b]{3}}
\put( 50, 130){\makebox(0, 0)[r]{\small One}}
\put( 60, 130){\line(1, 0){70}}
\put( 60, 125){\line(0, 1){10}}
\put(100, 125){\line(0, 1){10}}
\put(130, 130){\circle*{6}}
\put( 80, 135){\makebox(0, 0)[b]{4}}
\put(115, 135){\makebox(0, 0)[b]{3}}
\end{picture}
\caption{\it Follow-up of two persons}
\label{fig:fu2}
\end{figure}
\section{Time scales}
A time scale is a variable that varies deterministically \emph{within}
each person during follow-up, \textit{e.g.}:
\begin{itemize}
\item Age
\item Calendar time
\item Time since start of treatment
\item Time since relapse
\end{itemize}
All time scales advance at the same pace, so the time followed is the
same on all time scales. Therefore, it will suffice to use only the
entry point on each of the time scales, for example:
\begin{itemize}
\item Age at entry
\item Date of entry
\item Time at treatment (\emph{at} treatment this is 0)
\item Time at relapse (\emph{at} relapse this is 0)
\end{itemize}
In the \texttt{Epi} package, follow-up in a cohort is represented in a
\texttt{Lexis} object. A \texttt{Lexis} object is a data
frame with some extra structure to representing the follow-up. For the
\texttt{DMlate} data --- follow-up of diabetes patients in Denmark
with recorded date of birth, date of diabetes, date of insulin use, date
of first oral drug use, date of exit and date of death --- we can construct a
\texttt{Lexis} object by first including follow-up from entry to exit:
<<>>=
data(DMlate)
head(DMlate)
dmL <- Lexis(entry = list(per = dodm,
age = dodm-dobth,
tfD = 0),
exit = list(per = dox),
exit.status = factor(!is.na(dodth),
labels = c("DM", "Dead")),
data = DMlate)
timeScales(dmL)
@ %
(The excluded persons are persons with date of diabetes equal to date
of death.)
The \texttt{entry} argument is a \emph{named} list with the entry
points on each of the time scales we want to use. It defines the names
of the time scales and the entry points of the follow-up of each
person. The \texttt{exit} argument gives the exit time on \emph{one}
of the time scales, so the name of the element in this list must match
one of the names of the \texttt{entry} list. This is sufficient,
because the follow-up time on all time scales is the same, in this
case $\mathtt{dox}$-$\mathtt{dodm}$.
The \texttt{exit.status} is a categorical variable (a \emph{factor})
that indicates the exit status --- in this case whether the person
(still) is in state \texttt{DM} or exits to \texttt{Dead} at the end
of follow-up. In principle we should also indicate an
\texttt{entry.status}, but the default is to assume that all persons
enter in the \emph{first} level of \texttt{exit.state}s ---
in this case \texttt{DM}, because $\mathtt{FALSE}<\mathtt{TRUE}$.
Now take a look at the result:
<<>>=
str(dmL)
head(dmL)[, 1:10]
@ %
The \texttt{Lexis} object \texttt{dmL} has a variable for each time
scale, the value of which is the entry for each person on this time
scale. The length of the follow-up time is in the variable
\texttt{lex.dur} (\texttt{dur}ation). Note that the exit status is in
the variable \texttt{lex.Xst} (e\texttt{X}it \texttt{st}ate). The
variable \texttt{lex.Cst} is the state where the follow-up takes place
(\texttt{C}urrent \texttt{st}ate), in this case \texttt{DM} (alive
with diabetes) for all persons. This implies that observations
\emph{censored} in state \texttt{Zst}, say, are characterized by
having $\mathtt{lex.Cst} = \mathtt{lex.Xst} = \mathtt{Zst}$.
There is a \texttt{summary} function for \texttt{Lexis} objects that
lists the number of transitions and records as well as the total
amount of follow-up time; it also (optionally) prints information
about the names of the variables that constitute the time scales:
<<>>=
summary(dmL, timeScales = TRUE)
@ %
It is possible to get a visualization of the follow-up along the
time scales chosen by using the \texttt{plot} method for \texttt{Lexis}
objects. \texttt{dmL} is an object of \emph{class} \texttt{Lexis}, so
using the function \texttt{plot()} on it means that \R\ will look for
the function \texttt{plot.Lexis} and use this function.
<>=
plot(dmL)
@ %
The function allows quite a bit of control over the output, and a
\texttt{points.Lexis} function allows plotting of the endpoints of
follow-up:
<>=
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0) / 1.6)
plot(dmL, 1:2, lwd = 1, col = c("blue", "red")[dmL$sex],
grid = TRUE, lty.grid = 1, col.grid = gray(0.7),
xlim = 1960 + c(0, 60), xaxs = "i",
ylim = 40 + c(0, 60), yaxs = "i", las = 1)
points(dmL, 1:2, pch = c(NA, 3)[dmL$lex.Xst],
col = "lightgray", lwd = 3, cex = 0.3)
points(dmL, 1:2, pch = c(NA, 3)[dmL$lex.Xst],
col = c("blue", "red")[dmL$sex], lwd = 1, cex = 0.3)
box(bty = 'o')
@ %
In the above code you will note that the values of the arguments
\texttt{col} and \texttt{pch} are indexed by factors, using the
convention in \R\ that the index is taken as \emph{number of the
level} of the supplied factor. Thus \texttt{c("blue",
"red")[dmL\$sex]} is \texttt{"blue"} when \texttt{sex} is \texttt{M}
(the first level of \texttt{sex}) and. \texttt{"red"} when
\texttt{sex} is \texttt{F} (the second level of \texttt{sex}).
The results of these two plotting commands are in figure
\ref{fig:Lexis-diagram}, p. \pageref{fig:Lexis-diagram}.
\begin{figure}[tb]
\centering
\includegraphics[width = 0.35\textwidth]{flup-dmL1}
\includegraphics[width = 0.63\textwidth]{flup-dmL2}
\caption{\it Lexis diagram of the \textrm{\tt DMlate} dataset; left
panel is the default version, right panel is with some bells and
whistles. The red lines are for women, blue for men, crosses
indicate deaths.}
\label{fig:Lexis-diagram}
\end{figure}
\section{Splitting the follow-up time along a time scale}
In next chapter we shall conduct statistical analysis of mortality
rates, and a prerequisite for parametric analysis of rates is that
follow-up time is subdivided in smaller intervals, where we can
reasonably assume that rates are constant.
The follow-up time in a cohort can be subdivided (``split'') along a
time scale, for example current age. This is achieved by the
\texttt{splitLexis} (note that it is \emph{not} called
\texttt{split.Lexis}). This requires that the time scale and the
breakpoints on this time scale are supplied. Try:
<<>>=
dmS1 <- splitLexis(dmL, "age", breaks = seq(0, 100, 5))
summary(dmL)
summary(dmS1)
@ %
We see that the number of persons and events and the amount of
follow-up is the same in the two data sets; only the number of records
differ --- the extra records all have \texttt{lex.Cst} = \texttt{DM} and
\texttt{lex.Xst} = \texttt{DM}.
To see how records are split for each individual, it is useful to list
the results for a few individuals (whom we selected with a view to the
illustrative usefulness):
<<>>=
wh.id <- c(9, 27, 52, 484)
subset(dmL , lex.id %in% wh.id)[, 1:10]
subset(dmS1, lex.id %in% wh.id)[, 1:10]
@ %
The resulting object, \texttt{dmS1}, is again a \texttt{Lexis} object,
and the follow-up may be split further along another time scale, for
example diabetes duration, \texttt{tfD}. Subsequently we list the
results for the chosen individuals:
<<>>=
dmS2 <- splitLexis(dmS1, "tfD", breaks = c(0, 1, 2, 5, 10, 20, 30, 40))
subset(dmS2, lex.id %in% wh.id)[, 1:10]
@ %
A more efficient (and more intuitive) way of making this double split
is to use the \texttt{splitMulti} function from the \texttt{popEpi}
package:
<<>>=
dmM <- splitMulti(dmL,
age = seq(0, 100, 5),
tfD = c(0, 1, 2, 5, 10, 20, 30, 40),
drop = FALSE)
summary(dmS2)
summary(dmM)
@ %
Note we used the argument \texttt{drop = FALSE} which will retain
follow-up also outside the window defined by the range of the
breaks. Otherwise, the default for \texttt{splitMulti} would be to drop
follow-up outside \texttt{age} [0, 100] and \texttt{tfD} [0, 40]. This
clipping behaviour is not available in \texttt{splitLexis},
nevertheless this may be exactly what we want in some situations.
The recommended way of splitting follow-up time is by
\texttt{splitMulti}, because it is faster. But you should be aware
that the result is a \texttt{data.table} object unless you set the
option \texttt{"popEpi.datatable" = FALSE}. In some circumstances
\texttt{data.table}s behaves slightly differently from
\texttt{data.frame}s. See the manual for \texttt{data.table}.
\section{Cutting follow up time at dates of intermediate events}
If we have a recording of the date of a specific event as for example
recovery or relapse, we may classify follow-up time as being before or
after this intermediate event, but it requires that follow-up records
that straddle the event be cut in two and placed in separate records,
one representing follow-up \emph{before} the intermediate event, and another
representing follow-up \emph{after} the intermediate event. This is
achieved with the function \texttt{cutLexis}, which takes three
arguments: the time point of the intermediate event, the time scale
that this point refers to, and the value of the (new) state following
the date. Optionally, we may also define a new time scale with the
argument \texttt{new.scale = }.
We are interested in the time before and after inception of insulin
use, which occurs at the date \texttt{doins}:
<<>>=
subset(dmL, lex.id %in% wh.id)
dmC <- cutLexis(data = dmL,
cut = dmL$doins,
timescale = "per",
new.state = "Ins",
new.scale = "tfI")
subset(dmC, lex.id %in% wh.id)[, 1:10]
@ %
Note that the process of cutting time is simplified by having all
types of events referred to the calendar time scale. This is a
generally applicable advice in handling follow-up data: Get all event
times as \emph{dates}, location of events and follow-up on other time
scales can then easily be derived from this.
Note that individual 52 has had his follow-up cut at 6.55 years from
diabetes diagnosis and individual 484 at 5.70 years from diabetes
diagnosis. This dataset could then be split along the time scales as
we did before with \texttt{dmL}.
The result of this can also be achieved by cutting the split dataset
\texttt{dmS2} instead of \texttt{dmL}:
<<>>=
dmS2C <- cutLexis(data = dmS2,
cut = dmS2$doins,
timescale = "per",
new.state = "Ins",
new.scale = "tfI")
subset(dmS2C, lex.id %in% wh.id)
@ % $
Thus it does not matter in which order we use \texttt{splitLexis} and
\texttt{cutLexis}. Mathematicians would say that \texttt{splitLexis}
and \texttt{cutLexis} are commutative.
Note that for \texttt{lex.id} = 484, the follow-up subsequent to the
event is classified as being in state \texttt{Ins}, but that the final
transition to state \texttt{Dead} is preserved. This is the point of
the \texttt{precursor.states=} argument, which defaults to
\texttt{transient(dmS2)}---the transient states (states that a person
can exit from). It names the states (in this case \texttt{DM}) that
will be over-written by \texttt{new.state} (in this case
\texttt{Ins}), while the state \texttt{Dead} should not be updated
even if it is after the time where the persons moves to state
\texttt{Ins}. In other words, only state \texttt{DM} is a precursor to
state \texttt{Ins}, state \texttt{Dead} is always subsequent to state
\texttt{Ins}.
Note that we defined a new time scale, \texttt{tfI}, using the
argument \texttt{new.scale = tfI}. This has a special status relative to
the other three time scales, it is defined as time since entry into a
state, namely \texttt{Ins}, this is noted in the time scale part of
the summary of \texttt{Lexis} object --- the information sits in the
attribute \texttt{time.since} of the \texttt{Lexis} object, which can
be accessed by the function \texttt{timeSince()} or through the
\texttt{summary()}:
<<>>=
summary(dmS2C, timeScales = TRUE)
@ %
Finally we can get a quick overview of the states and transitions by
using \texttt{boxes} --- \texttt{scale.R} scales transition rates to
rates per 1000 PY:
<>=
boxes(dmC, boxpos = TRUE, scale.R = 1000, show.BE = TRUE)
legendbox(70, 95)
@ %
\insfig{box1}{0.8}{States, person years, transitions and rates in the
cut dataset. The numbers \emph{in} the boxes are person-years and
the number of persons \texttt{B}eginning, resp. \texttt{E}nding
their follow-up in each state (triggered by \textrm{\tt
show.BE = TRUE}). The numbers at the arrows are the number of
transitions and transition rates per 1000 (triggered by \textrm{\tt
scale.R = 1000}).}
The explanatory box in the upper right corner was generated by \texttt{legendbox}.
\chapter{Modeling rates from \texttt{Lexis} objects}
\section{Covariates}
In the dataset \texttt{dmS2C} there are three types of covariates that
can be used to describe mortality rates:
\begin{enumerate}
\item time-dependent covariates
\item time scales
\item fixed covariates
\end{enumerate}
There is only one time-dependent covariate here, namely
\texttt{lex.Cst}, the current state of the person's follow up; it
takes the values \texttt{DM} and \texttt{Ins} according to whether the
person has ever purchased insulin at a given time of follow-up.
The time-scales are obvious candidates for explanatory variables for
the rates, notably age and time from diagnosis (duration of diabetes)
and insulin.
\subsection{Time scales as covariates}
If we want to model the effect of the time scale variables on
occurrence rates, we will for each interval use either the value of
the left endpoint in each interval or the middle. There is a function
\texttt{timeBand} which returns either of these:
<<>>=
timeBand(dmS2C, "age", "middle")[1:10]
# For nice printing and column labelling we use the data.frame() function:
data.frame(dmS2C[, c("per", "age", "tfD", "lex.dur")],
mid.age = timeBand(dmS2C, "age", "middle"),
mid.t = timeBand(dmS2C, "tfD", "middle"),
left.t = timeBand(dmS2C, "tfD", "left" ),
right.t = timeBand(dmS2C, "tfD", "right" ),
fact.t = timeBand(dmS2C, "tfD", "factor"))[1:15, ]
@ %
Note that the values of these functions are characteristics of the
intervals defined by \texttt{breaks = }, \emph{not} the midpoints nor
left or right endpoints of the \emph{actual} follow-up intervals
(which would be \texttt{tfD} and \texttt{tfD+lex.dur}, respectively).
These functions are intended for modeling time scale variables as
factors (categorical variables) in which case the coding must be
independent of the censoring and mortality pattern --- it should only
depend on the chosen grouping of the time scale. Modeling time scales as
\emph{quantitative} should not be based on these codings but directly
on the values of the time-scale variables, i.e. the left endpoints
of the intervals.
\subsection{Differences between time scales}
Apparently, the only fixed variable is \texttt{sex}, but formally the
dates of birth (\texttt{dobth}), diagnosis (\texttt{dodm}) and first
insulin purchase (\texttt{doins}) are fixed covariates too. They can
be constructed as origins of time scales referred to the calendar time
scale. Likewise, and possibly of greater interest, we can consider
these origins on the age scale, calculated as the difference between
age and another time scale.
These would then be age at birth (hardly relevant since it is the same
for all persons), age at diabetes
diagnosis and age at insulin treatment.
\subsection{Keeping the relation between time scales}
The midpoint (as well as the right interval endpoint) should be used
with caution if the variable age at diagnosis \texttt{dodm-dobth} is
modeled too; the age at diabetes is logically equal to the difference
between current age (\texttt{age}) and time since diabetes diagnosis
(\texttt{tfD}):
<<>>=
summary((dmS2$age - dmS2$tfD) - (dmS2$dodm - dmS2$dobth))
@ %
This calculation refers to the \emph{start} of each interval --- which
are in the time scale variables in the \texttt{Lexis} object. But when
using the middle of the intervals, this relationship is not preserved:
<<>>=
summary(timeBand(dmS2, "age", "middle") -
timeBand(dmS2, "tfD", "middle") -
(dmS2$dodm - dmS2$dobth))
@ %
If all three variables are to be included in a model, we must make
sure that the \emph{substantial} relationship between the variables be
maintained. One way is to recompute age at diabetes diagnosis from the two
midpoint variables, but more straightforward would be to use the left
endpoint of the intervals, that is the time scale variables in the
\texttt{Lexis} object.
If we dissolve the relationship between the variables \texttt{age},
\texttt{tfD} and age at diagnosis by grouping we may obtain
identifiability of the three separate effects, but it will be at the
price of an arbitrary allocation of a linear trend between them.
For the sake of clarity, consider current age, $a$, duration of
disease, $d$ and age at diagnosis $e$, where
\[
\text{current age} =
\text{age at diagnosis} +
\text{disease duration},
\quad \text{\ie} \quad a = e + d \quad
\Leftrightarrow \quad e + d - a = 0
\]
If we model the effect of the quantitative variables $a$, $e$ and $d$
on the log-rates by three functions $f$, $g$ and $h$:
$ \log(\lambda) = f(a)+g(d)+h(e) $
then for any $\kappa$:
\begin{align*}
\log(\lambda) & = f(a)+g(d)+h(e)+\kappa(e+d-a)\\
& =
\big(f(a)-\kappa a \big)+
\big(g(d)+\kappa d \big)+
\big(h(e)+\kappa e \big) \\
& = \tilde f(a)+ \tilde g(d)+ \tilde h(e)
\end{align*}
In practical modeling this will turn up as a singular model matrix
with one parameter aliased, corresponding to some arbitrarily chosen
value of $\kappa$ (depending on software conventions for singular
models). This phenomenon is well known from age-period-cohort models.
Thus we see that we can move any slope around between the three terms,
so if we achieve identifiability by using grouping of one of the
variables we will in reality have settled for a particular value of
$\kappa$, without known why we chose just that. The solution is to
resort to predictions which are independent of the particular
parametrization or choose a parametrization with explicit constraints.
\section{Modeling of rates}
As mentioned, the purpose of subdividing follow-up data in smaller
intervals is to be able to model effects of time scale variables as
parametric functions. When we split along a time scale we can get
intervals that are as small as we want; if they are sufficiently
small, an assumption of constant rates in each interval becomes
reasonable.
In a model that assumes a constant occurrence rate in each of the
intervals the likelihood contribution from each interval is the same
as the likelihood contribution from a Poisson variate $D$, say, with
mean $\lambda \ell$ where $\lambda$ is the rate and $\ell$ is the
interval length, and where the value of the variate $D$ is 1 or 0
according to whether an event has occurred or not. Moreover, the
likelihood contributions from all follow-up intervals from a single
person are \emph{conditionally} independent (conditional on having
survived till the start of the interval in question). This implies
that the total contribution to the likelihood from a single person is
a product of terms, and hence the same as the likelihood of a number
of independent Poisson terms, one from each interval.
Note that variables are neither Poisson distributed (\eg they can only
ever assume values 0 or 1) nor independent --- it is only the
likelihood for the follow-up data that happens to be the same as the
likelihood from independent Poisson variates. Different models can
have the same likelihood, a model cannot be inferred from the
likelihood.
Parametric modeling of the rates is obtained by using the
\emph{values} of the time scales for each interval as \emph{quantitative}
explanatory variables, using for example splines. And of course also
the values of the fixed covariates and the time-dependent variables
for each interval. Thus the model will be one where the rate is
assumed constant in each (small) interval, but where a parametric form
of the \emph{size} of the rate in each interval is imposed by the
model, using the time scale as a quantitative covariate.
\subsection{Interval length}
In the first chapter we illustrated cutting and splitting by listing
the results for a few individuals across a number of intervals. For
illustrational purposes we used 5-year age bands to avoid excessive
listings, but since the doubling time for mortality on the age scale
is only slightly larger than 5 years, the assumption about constant
rates in each interval would be pretty far fetched if we were to use 5
year intervals.
Thus, for modeling purposes we split the follow-up in 3 month
intervals. When we use intervals of 3 months length it is superfluous
to split along multiple time scales --- the precise location of
tightly spaced splits will be irrelevant from any practical point of
view. \texttt{splitLexis} and \texttt{splitMulti} will allocate the
actual split times for all of the time scale variables, so these can be
used directly in modeling.
So we split the cut dataset in 3 months intervals along the age
scale:
<<>>=
dmCs <- splitLexis(dmC, time.scale = "age", breaks = seq(0, 110, 1/4))
summary(dmCs, t = T)
@ %
We see that we now have 228,748 records and 9996 persons, so about 23
records per person. The total risk time is 54,275 years, a bit less
than 3 months on average per record as expected.
\subsection{Practicalities for splines}
In this study we want to look at how mortality depend on age
(\texttt{age}) and time since start of insulin use (\texttt{tfI}). If
we want to use splines in the description we must allocate knots for
anchoring the splines at each of the time scales, either by some
\textit{ad hoc} method or by using some sort of penalized splines as
for example by \texttt{gam}; the latter will not be treated here; it
belongs in the realm of the \texttt{mgcv} package.
Here we shall use the former approach and allocate 5 knots on each of
the time-scales. We allocate knots so that we have the events evenly
distributed between the knots. Since the insulin state starts at 0 for
all individuals we include 0 as the first knot, such that any set of natural
splines with these knots will have the value 0 at 0 on the time
scale.
<<>>=
(a.kn <- with(subset(dmCs, lex.Xst == "Dead"),
quantile(age+lex.dur, (1:5-0.5)/5)))
(i.kn <- c(0,
with(subset(dmCs, lex.Xst == "Dead" & lex.Cst == "Ins"),
quantile(tfI+lex.dur, (1:4)/5))))
@ %
In the \texttt{Epi} package there is a convenience wrapper,
\texttt{Ns}, for the \texttt{n}atural \texttt{s}pline generator
\texttt{ns}, that takes the smallest and the largest of a set of
supplied knots to be the boundary knots, so the explicit definition of
the boundary knots becomes superfluous.
Note that it is a feature of the \texttt{Ns} (via the features of
\texttt{ns}) that any generated spline function is 0 at the leftmost
knot.
\subsection{Poisson models}
A model that describes mortality rates as only a function of age would
then be:
<<>>=
ma <- glm((lex.Xst == "Dead") ~ Ns(age, knots = a.kn),
family = poisson,
offset = log(lex.dur),
data = dmCs)
summary(ma)
@ %
The offset, \texttt{log(lex.dur)} comes from the fact that the
likelihood for the follow-up data during $\ell$ time is the same as
that for independent Poisson variates with mean $\lambda \ell$, and
that the default link function for the Poisson family is the log, so
that we are using a linear model for the log-mean,
$\log(\lambda) + \log(\ell)$. But when we want a model for the
log-rate ($\log(\lambda)$), the term $\log(\ell)$ must still be
included as a covariate, but with regression coefficient fixed to 1; a
so-called \emph{offset}. This is however a technicality; it just
exploits that the likelihood of a particular Poisson model and that of
the rates model is the same.
In the \texttt{Epi} package is a \texttt{glm} family, \texttt{poisreg}
that has a more intuitive interface to the likelihood of rates, namely
where the response is a 2-column matrix of events and person-time,
respectively. This is in concert with the fact that the outcome
variable in follow-up studies is bivariate: (event, risk time).
<<>>=
Ma <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn),
family = poisreg,
data = dmCs)
summary(Ma)
@ %
Exploiting the multistate structure in the \texttt{Lexis} object
there is a convenience wrapper for \texttt{glm} with the
\texttt{poisreg} family, that just requires specification of the
transitions in terms of \texttt{from} and \texttt{to}. Although it is
called \texttt{glm.Lexis} it is \emph{not} an S3 method for
\texttt{Lexis} objects:
<<>>=
Xa <- glm.Lexis(dmCs, from = "DM", to = "Dead",
formula = ~ Ns(age, knots = a.kn))
@ %
The result is a \texttt{glm} object but with an extra attribute,
\texttt{Lexis} with the name of the data, transition(s) modeled and
model formula
<<>>=
attr(Xa, "Lexis")
@ %
There are similar wrappers for \texttt{gam} and \texttt{coxph} models,
\texttt{gam.Lexis} and \texttt{coxph.Lexis}, but these will not be
elaborated in detail here.
The \texttt{from=} and \texttt{to=} can be omitted, in which case
all possible transitions \emph{into} any of the absorbing states is
modeled:
<<>>=
xa <- glm.Lexis(dmCs, formula = ~ Ns(age, knots = a.kn))
@
We can check if the four models fitted are the same:
<<>>=
c(deviance(ma), deviance(Ma), deviance(Xa), deviance(xa))
@ %
Oops! the model \texttt{Xa} is apparently not the same as the other
three? This is because the explicit specification
\verb|from = "DM", to = "Dead"|, omits modeling contributions from the
$\mathtt{Ins}\rightarrow\mathtt{Dead}$ transition --- the output
actually said so --- see also figure \ref{fig:box1} on
p. \pageref{fig:box1}. The other three models all use both transitions
--- and assume that the two transition rates are the same, \ie that
start of insulin has no effect on mortality. We shall relax this
assumption later.
The parameters from the model do not have any direct interpretation
\textit{per se}, but we can compute the estimated mortality rates for
a range of ages using \texttt{ci.pred} with a suitably defined
prediction data frame.
Note that if we use the \texttt{poisson} family of models, we must
specify all covariates in the model, including the variable in the
offset, \texttt{lex.dur} (remember that this was a covariate with
coefficient fixed at 1). We set the latter to 1000, because we want the
mortality rates per 1000 person-years. Using the \texttt{poisreg}
family, the prediction will ignore any value of \texttt{lex.dur}
specified in the prediction data frame, the returned rates will be per
unit in which \texttt{lex.dur} is recorded.
<>=
nd <- data.frame(age = 40:85, lex.dur = 1000)
pr.0 <- ci.pred(ma, newdata = nd) # mortality per 100 PY
pr.a <- ci.pred(Ma, newdata = nd)*1000 # mortality per 100 PY
summary(pr.0/pr.a)
matshade(nd$age, pr.a, plot = TRUE,
type = "l", lty = 1,
log = "y", xlab = "Age (years)",
ylab = "DM mortality per 1000 PY")
@ %
\insfig{pr-a}{0.8}{Mortality among Danish diabetes patients by age
with 95\% CI as shaded area. We see that the rates increase linearly
on the log-scale, that is, exponentially by age.}
\section{Time dependent covariate}
A Poisson model approach to mortality by insulin status, would be to
assume that the rate-ratio between patients on insulin and not on
insulin is a fixed quantity, independent of time since start of insulin,
independent of age. This is commonly termed a proportional hazards
assumption, because the rates (hazards) in the two groups are
proportional along the age (baseline time) scale.
<<>>=
pm <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn)
+ lex.Cst + sex,
family = poisreg,
data = dmCs)
round(ci.exp(pm), 3)
@ %
Again we can simplify the code using \code{glm.Lexis}:
<<>>=
pm <- glm.Lexis(dmCs, ~ Ns(age, knots = a.kn) + lex.Cst + sex)
round(ci.exp(pm), 3)
@ %
So we see that persons on insulin have about twice the mortality of
persons not on insulin and that women have 2/3 the mortality of men.
\subsection{Time since insulin start}
If we want to test whether the excess mortality depends on the time
since start if insulin treatment, we can add a spline terms in
\texttt{tfI}. But since \texttt{tfI} is a time scale defined as time
since entry into a new state (\texttt{Ins}), the variable \texttt{tfI}
will be missing for those in the \texttt{DM} state, so before modeling
we must set the \texttt{NA}s to 0, which we do with \texttt{tsNA20}
(acronym for \texttt{t}ime\texttt{s}cale \texttt{NA}s to zero):
<<>>=
pm <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn)
+ Ns(tfI, knots = i.kn)
+ lex.Cst + sex,
family = poisreg,
data = tsNA20(dmCs))
@ %
As noted before we could do this simpler with \texttt{glm.Lexis}, even
without the \texttt{from=} and \texttt{to=} arguments, because we are
modeling all transitions \emph{into} the absorbing state
(\texttt{Dead}):
<<>>=
Pm <- glm.Lexis(tsNA20(dmCs),
form = ~ Ns(age, knots = a.kn)
+ Ns(tfI, knots = i.kn)
+ lex.Cst + sex)
c(deviance(Pm), deviance(pm))
identical(model.matrix(Pm), model.matrix(pm))
@ %
The coding of the effect of \texttt{tfI} is so that the value is 0 at
0, so the meaning of the estimate of the effect of \texttt{lex.Cst} is
the RR between persons with and without insulin, immediately after
start of insulin:
<<>>=
round(ci.exp(Pm, subset = "ex"), 3)
@ %
We see that the effect of sex is pretty much the same as before, but
the effect of \texttt{lex.Cst} is much larger, it now refers to a
different quantity, namely the RR at \texttt{tfI} = 0. If we want to see
the effect of time since insulin, it is best viewed jointly with the
effect of age:
<>=
ndI <- data.frame(expand.grid(tfI = c(NA, seq(0, 15, 0.1)),
ai = seq(40, 80, 10)),
sex = "M",
lex.Cst = "Ins")
ndI <- transform(ndI, age = ai+tfI)
head(ndI)
ndA <- data.frame(age = seq(40, 100, 0.1), tfI = 0, lex.Cst = "DM", sex = "M")
pri <- ci.pred(Pm, ndI) * 1000
pra <- ci.pred(Pm, ndA) * 1000
matshade(ndI$age, pri, plot = TRUE, las = 1,
xlab = "Age (years)", ylab = "DM mortality per 1000 PY",
log = "y", lty = 1, col = "blue")
matshade(ndA$age, pra)
@ %
\insfig{ins-time}{0.8}{Mortality rates of persons on insulin, starting
insulin at ages 40, 50, \ldots, 80 (blue), compared with persons not on
insulin (black curve). Shaded areas are 95\% CI.}
In figure \ref{fig:ins-time}, p. \pageref{fig:ins-time}, we see that
mortality is high just after insulin start, but falls by almost a
factor 3 during the first year. Also we see that there is a tendency
that mortality in a given age is smallest for those with the longest
duration of insulin use.
\section{The Cox model}
Note that in the implementation of Cox-model the \texttt{age} appears
as response variable, slightly counter-intuitive as it really is a
covariate. Hence the age part of the linear predictors is not in the
model specification:
<<>>=
cm <- coxph(Surv(age, age+lex.dur, lex.Xst == "Dead") ~
Ns(tfI, knots = i.kn) + lex.Cst + sex,
data = tsNA20(dmCs))
@ %
There is also a multistate wrapper for Cox models, requiring a
l.h.s. side for the \texttt{formula = } argument:
<<>>=
Cm <- coxph.Lexis(tsNA20(dmCs),
form = age ~ Ns(tfI, knots = i.kn) + lex.Cst + sex)
round(cbind(ci.exp(cm), ci.exp(Cm)), 4)
@ %
Note that this is really a Cox model with two time scales: the
baseline time scale \texttt{age} and the time since insulin,
\texttt{tfi}. They are modeled differently, \emph{age}
non-parametrically and \texttt{tfI} with a smooth parametric
spline. And only the spline effects is shown in the parameters.
We can compare the estimates from the Cox model with those from the
Poisson model --- we must add \texttt{NA}s because the Cox-model does
not give the parameters for the baseline time scale (\texttt{age}), but
also remove one of the parameters, because \texttt{coxph} parametrizes
factors (in this case \texttt{lex.Cst}) by all defined levels and not
only by the levels present in the dataset at hand (note the line of
\texttt{1.0000000}s in the print above):
<<>>=
round(cbind(ci.exp(Pm),
rbind(matrix(NA, 5, 3),
ci.exp(cm)[-6, ])), 3)
@ %
Thus we see that the Poisson and Cox gives pretty much the same
results. You may argue that Cox requires a smaller dataset, because
there is no need to subdivide data in small intervals \emph{before}
insulin use. But certainly the time \emph{after} insulin inception need
to be if the effect of this time should be modeled.
The drawback of the Cox-modeling is that it is not possible to show
the absolute rates as we did in figure \ref{fig:ins-time} on
\pageref{fig:ins-time}.
\section{Marginal effect of time since insulin}
When we plot the marginal effect of \texttt{tfI} from the two models
we get pretty much the same; here we plot the RR relative to
\texttt{tfI} = 2 years. Note that we are deriving the RR as the ratio
of two sets of predictions, from the data frames \texttt{nd} and
\texttt{nr}---variables assumed to have the same values in the two
data frames need not be included in the prediction frames, but
numerical variables omitted must be indicated in the \texttt{xvars=}
argument. For further details, consult the help page for
\texttt{ci.lin}, specifically the use of a list as the
\texttt{ctr.mat} argument:
<>=
nd <- data.frame(tfI = seq(0, 15, , 151), lex.Cst = "Ins", sex = "M")
nr <- data.frame(tfI = 2 , lex.Cst = "Ins", sex = "M")
# We need to put in xvars="age" because age is in the model but not
# in the prediction frames
ppr <- ci.exp(pm, list(nd, nr), xvars = "age")
cpr <- ci.exp(cm, list(nd, nr))
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(nd$tfI, cbind(ppr, cpr), plot = T,
lty = c(1, 2), log = "y",
xlab = "Time since insulin (years)", ylab = "Mortality rate ratio")
abline(h = 1, lty = 3)
@ %
\insfig{Ieff}{0.8}{The naked duration effects on mortality relative to 2 years of
duration. Black from Poisson model, red from Cox model. The two sets
of estimates are identical, and so are the standard errors, so the
two shaded areas overlap almost perfectly.}
In figure \ref{fig:Ieff}, p. \pageref{fig:Ieff}, we see that the
duration effect is exactly the same from the two modeling approaches
(Cox and Poisson).
We will also want the RR relative to the non-insulin users --- recall these
are coded 0 on the \texttt{tfI} variable:
<>=
nd <- data.frame(tfI = seq(0, 15, , 151), lex.Cst = "Ins", sex = "M")
nr <- data.frame(tfI = 0 , lex.Cst = "DM" , sex = "M")
ppr <- ci.exp(pm, list(nd, nr), xvars = "age")
cpr <- ci.exp(cm, list(nd, nr))
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(nd$tfI, cbind(ppr, cpr), lwd = 2,
xlab = "Time since insulin (years)",
ylab = "Rate ratio relative to non-Insulin",
lty = c(1, 2), log = "y", plot = T)
@ %
\insfig{IeffR}{0.8}{Insulin duration effect (state \textrm{\tt Ins})
relative to no insulin (state \textrm{\tt DM}), black from Poisson
model, red from Cox model. The \emph{shape} is the same as the
previous figure, but the RR is now relative to non-insulin, instead
of relative to insulin users at 2 years duration. The estimates from
the Cox model and the Poisson model are virtually identical, and so
are the standard errors, so the two shaded areas overlap almost
perfectly.}
In figure \ref{fig:IeffR}, p. \pageref{fig:IeffR}, we see the effect
of increasing duration of insulin use \emph{for a fixed age} which is
a bit artificial, so we would like to see the \emph{joint} effects of
age and insulin duration. What we cannot see is how the duration
affects mortality relative to \texttt{current} age (at the age
attained at the same time as the attained \texttt{tfI}).
Another way of interpreting this curve is as the rate ratio relative
to a person not on insulin, so we see that the RR (or hazard ratio, HR
as some call it) is over 5 at the start of insulin (the
\texttt{lex.Cst} estimate), and decreases to about 1.5 in the long
term.
Both figure \ref{fig:Ieff} and \ref{fig:IeffR} indicate a declining RR
by insulin duration, but only from figure \ref{fig:ins-time} it is
visible that mortality actually is \emph{in}creasing by age after some
2 years after insulin start. This point would not be available if we
had only fitted a Cox model where we did not have access to the
baseline hazard as a function of age.
\section{Age$\times$duration interaction}
The model we fitted assumes that the RR is the same regardless of the
age at start of insulin --- the hazards are multiplicative. Sometimes
this is termed the proportional hazards assumption: For \emph{any}
fixed age the HR is the same as a function of time since insulin, and
vice versa.
A more correct term would be ``main effects model'' --- there is no
interaction between age (the baseline time scale) and other
covariates. So there is really no need for the term ``proportional
hazards''; well defined and precise statistical terms for it has
existed for eons.
\subsection{Age at insulin start}
In order to check the consistency of the multiplicative assumption
across the spectrum of age at insulin inception, we can fit an
interaction model. One approach to this would be using a non-linear
effect of age at insulin inception (for convenience we use the same knots as
for age) --- note that the prediction data frames would be the same as we
used above, because we do not compute age at insulin use as a separate
variable, but rather enter it as the difference between current age
(\texttt{age}) and insulin duration (\texttt{tfI}).
At first glance we might think of doing:
<<>>=
imx <- glm.Lexis(tsNA20(dmCs),
formula = ~ Ns(age , knots = a.kn)
+ Ns( tfI, knots = i.kn)
+ Ns(age - tfI, knots = a.kn)
+ lex.Cst + sex)
@ %
But this will fit a model where rate-ratio between persons with and
without insulin at start of insulin (\texttt{tfI} = 0) will be the
same at any age, which really is not the type of interaction we
want.
We want the \texttt{age-tfI} term to be specific for the insulin
exposed so we may use one of two other approaches, that are
conceptually alike but mathematically different:
<<>>=
Im <- glm.Lexis(tsNA20(dmCs),
formula = ~ Ns( age , knots = a.kn)
+ Ns( tfI, knots = i.kn)
+ Ns((age - tfI) * (lex.Cst == "Ins"), knots = a.kn)
+ lex.Cst + sex)
im <- glm.Lexis(tsNA20(dmCs),
formula = ~ Ns(age , knots = a.kn)
+ Ns( tfI, knots = i.kn)
+ lex.Cst:Ns(age - tfI, knots = a.kn)
+ lex.Cst + sex)
ci.exp(Im)
ci.exp(im)
@ %
The first model (\texttt{Im}) has a common age-effect (\texttt{Ns(age, ...}) for
persons with and without insulin and a RR depending on insulin
duration \texttt{tfI} and age at insulin (\texttt{age-tfI}). Since the
linear effect of these two terms are in the model as well, a linear
trend in the RR by current age (\texttt{age}) is accommodated as well.
The second model (\texttt{im}) allows age-effects that differ
non-linearly between persons with and without insulin, because the
interaction term \texttt{lex.Cst:Ns(age-tfI...} for persons not on
insulin is merely an age term (since \texttt{tfI} is coded 0 for all
follow-up not on insulin).
We can compare the models fitted:
<<>>=
anova(imx, Im, im, test = 'Chisq')
@ %
so we see that the models indeed are different, and moreover that the
last model with a more elaborate interaction does not provide
substantial further improvement, by allowing non-linear RR along the
age-scale.
We can illustrate the different estimated rates from the three models
in figure \ref{fig:dur-int}, p. \pageref{fig:dur-int}:
<>=
pxi <- ci.pred(imx, ndI)
pxa <- ci.pred(imx, ndA)
pIi <- ci.pred(Im , ndI)
pIa <- ci.pred(Im , ndA)
pii <- ci.pred(im , ndI)
pia <- ci.pred(im , ndA)
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, cbind(pxi, pIi, pii)*1000, plot = T, log = "y",
xlab = "Age", ylab = "Mortality per 1000 PY",
lty = 1, lwd = 2, col = c("blue", "forestgreen", "red"), alpha = 0.1)
matshade(ndA$age, cbind(pxa, pIa, pia)*1000,
lty = 1, lwd = 2, col = c("blue", "forestgreen", "red"), alpha = 0.1)
@ %
\insfig{dur-int}{0.8}{Age at insulin as interaction between age and
duration, for persons starting insulin at ages 40, 50,\ldots and
persons not on insulin. Blue curves are from the naive interaction model
\textrm{\tt imx} with identical $\RR$ at \textrm{\tt tfI} = 0 at any
age; green curves are from the interaction model with age at
insulin, from the model \textrm{\tt Im} with only linear
differences by age, and red lines from the full interaction model
\textrm{\tt im}.}
We can also plot the RRs only from these models (figure
\ref{fig:dur-int-RR}, p. \pageref{fig:dur-int-RR}); for this we need
the reference frames, and the machinery from \texttt{ci.exp} allowing
a list of two data frames:
<>=
ndR <- transform(ndI, tfI = 0, lex.Cst = "DM")
cbind(head(ndI), head(ndR))
Rxi <- ci.exp(imx, list(ndI, ndR))
Rii <- ci.exp(im , list(ndI, ndR))
RIi <- ci.exp(Im , list(ndI, ndR))
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, cbind(Rxi, RIi, Rii), plot = T, log = "y",
xlab = "Age (years)", ylab = "Rate ratio vs, non-Insulin",
lty = 1, lwd = 2, col = c("blue", "forestgreen", "red"), alpha = 0.1)
abline(h = 1)
abline(h = ci.exp(imx, subset = "lex.Cst")[, 1], lty = "25", col = "blue")
@ %
\insfig{dur-int-RR}{0.9}{RR from three different interaction
models. The horizontal dotted line is at the estimated effect of
\textrm{\tt lex.Cst}, to illustrate that the first model (blue)
constrains this initial HR to be constant across age. The green
curves are the extended interaction model, and the red the full
one.}
\clearpage
\subsection{General interaction}
As a final illustration we may want to explore a different kind of
interaction, not defined from the duration --- here we simplify the
interaction by not using the second-last knot in the interaction terms
--- figure \ref{fig:splint}, p. \pageref{fig:splint}. Note again that
the prediction code is the same:
<>=
gm <- glm.Lexis(tsNA20(dmCs),
formula = ~ Ns(age, knots = a.kn)
+ Ns(tfI, knots = i.kn)
+ lex.Cst:Ns(age, knots = a.kn):Ns(tfI, knots = i.kn)
+ lex.Cst + sex)
pgi <- ci.pred(gm, ndI)
pga <- ci.pred(gm, ndA)
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, cbind(pgi, pii)*1000, plot = T,
lty = c("solid", "21"), lend = "butt", lwd = 2, log = "y",
xlab = "Age (years)", ylab = "Mortality rates per 1000 PY",
alpha = c(0.2, 0.1), col = c("black", "red"))
matshade(ndA$age, cbind(pga, pia)*1000,
lty = c("solid", "21"), lend = "butt", lwd = 2,
alpha = c(0.2, 0.1), col = c("black", "red"))
@ %
\insfig{splint}{0.8}{Spline-by-spline interaction between age and
duration (model \textrm{\tt gm}, black), and the interaction using a
non-linear effect of age at entry (model \textrm{\tt im}, red),
corresponding to the red curves in figure \ref{fig:dur-int}.}
This is in figure \ref{fig:splint}, p. \pageref{fig:splint}.
\subsection{Evaluating interactions}
Here we see that the interaction effect is such that in the older ages
the length of insulin use has an increasing effect on mortality.
Even though there is no statistically significant interaction between
age and time since start of insulin, it would be illustrative to show
the RR as a function of age at insulin and age at follow-up:
<>=
ndR <- transform(ndI, lex.Cst = "DM", tfI = 0)
iRR <- ci.exp(im, ctr.mat = list(ndI, ndR))
gRR <- ci.exp(gm, ctr.mat = list(ndI, ndR))
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, cbind(gRR, iRR), lty = 1, log = "y", plot = TRUE,
xlab = "Age (years)", ylab = "Rate ratio: Ins vs. non-Ins",
col = c("black", "red"), lwd = 2)
abline(h = 1)
@ % $
\insfig{RR-int}{0.8}{The effect of duration of insulin use at
different ages of follow-up (and age at insulin start). Estimates
are from the model with an interaction term using a non-linear
effect of age at insulin start (model \textrm{\tt im}, red) and
using a general spline interactions (model \textrm{\tt gm},
black). It appears that the general interaction over-models a bit.}
This is in figure \ref{fig:RR-int}, p. \pageref{fig:RR-int}.
The advantage of the parametric modeling (be that with age at insulin
or general spline interaction) is that it is straight-forward to
\emph{test} whether we have an interaction.
\section{Separate models}
In the above we insisted on making a joint model for the
\texttt{DM}$\rightarrow$\texttt{Dead} and the
\texttt{Ins}$\rightarrow$\texttt{Dead}
transitions, but with the complications demonstrated it would actually
have been more sensible to model the two transitions separately:
<<>>=
dmd <- glm.Lexis(dmCs,
from = "DM", to = "Dead",
formula = ~ Ns(age, knots = a.kn)
+ sex)
ind <- glm.Lexis(dmCs,
from = "Ins", to = "Dead",
formula = ~ Ns(age , knots = a.kn)
+ Ns( tfI, knots = i.kn)
+ Ns(age - tfI, knots = a.kn)
+ sex)
ini <- ci.pred(ind, ndI)
dmi <- ci.pred(dmd, ndI)
dma <- ci.pred(dmd, ndA)
@ %
The estimated mortality rates are shown in figure \ref{fig:sep-mort},
p. \pageref{fig:sep-mort}, using:
<>=
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, ini * 1000, plot = TRUE, log = "y",
xlab = "Age (years)", ylab = "Mortality rates per 1000 PY",
lwd = 2, col = "red")
matshade(ndA$age, dma*1000,
lwd = 2, col = "black")
@ %
The estimated RRs can now be computed exploiting the fact that the
estimates from the two models are uncorrelated, and hence qualify for
\texttt{ci.ratio} (this and the previous graph appear in figure
\ref{fig:Ins-noIns}, p. \pageref{fig:Ins-noIns})
<>=
par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n")
matshade(ndI$age, ci.ratio(ini, dmi), plot = TRUE, log = "y",
xlab = "Age (years)", ylab = "RR insulin vs. no insulin",
lwd = 2, col = "red")
abline(h = 1)
@ %
\begin{figure}[tb]
\centering
\includegraphics[width = 0.49\textwidth]{flup-sep-mort}
\includegraphics[width = 0.49\textwidth]{flup-sep-HR}
\caption{\it Left panel: Mortality rates from separate models for the
two mortality transitions; the \textrm{\tt
DM}$\rightarrow$\textrm{\tt Dead} transition modeled by age alone;
\textrm{\tt Ins}$\rightarrow$\textrm{\tt Dead} transition modeled
with spline effects of current age, time since insulin and and age
at insulin. \newline Right panel: Mortality HR of insulin vs. no insulin.}
\label{fig:Ins-noIns}
\end{figure}
\chapter{More states}
\section{Subdividing states}
It may be of interest to subdivide the states following the
intermediate event according to whether the event has occurred or
not. This will enable us to address the question of the fraction of
the patients that ever go on insulin.
This is done by the argument \texttt{split.states = TRUE}.
<<>>=
dmCs <- cutLexis(data = dmS2,
cut = dmS2$doins,
timescale = "per",
new.state = "Ins",
new.scale = "tfI",
split.states = TRUE)
summary(dmCs)
@ %
We can illustrate the numbers and the transitions (figure
\ref{fig:box4}, p. \pageref{fig:box4})
<>=
boxes(dmCs, boxpos = list(x = c(15, 15, 85, 85),
y = c(85, 15, 85, 15)),
scale.R = 1000, show.BE = TRUE)
legendbox(50, 50)
@ % $
\insfig{box4}{0.7}{Transitions between 4 states: the numbers \emph{in}
the boxes are person-years (middle), and below the number of persons
who start, respectively end their follow-up in each of the states.}
Note that it is only the mortality rates that we have been modeling,
namely the transitions \texttt{DM}$\rightarrow$\texttt{Dead}
and \texttt{Ins}$\rightarrow$\texttt{Dead(Ins)}.
If we were to model the cumulative risk of using insulin we would also
need a model for the DM$\rightarrow$Ins
transition. Subsequent to that we would then compute the probability
of being in each state conditional on suitable starting
conditions. With models where transition rates depend on several time
scales this is not a trivial task. This is treated in more detail in the
vignette on \texttt{simLexis}.
\section{Multiple intermediate events}
We may be interested in starting either insulin or OAD (oral
anti-diabetic drugs), thus giving rise to more states and more
time scales. This can be accomplished by the \texttt{mcutLexis}
function, that generalizes \texttt{cutLexis}:
<<>>=
dmM <- mcutLexis(dmL,
timescale = "per",
wh = c("doins", "dooad"),
new.states = c("Ins", "OAD"),
new.scales = c("tfI", "tfO"),
ties.resolve = TRUE)
summary(dmM, t = T)
@
We see that we now have two time scales defined as entry since into
states.
<<>>=
wh <- c(subset(dmM, lex.Cst == "Ins-OAD")$lex.id[1:2],
subset(dmM, lex.Cst == "OAD-Ins")$lex.id[1:2])
subset(dmM, lex.id %in% wh)
@ %
We can also illustrate the transitions to the different states, as in
figure \ref{fig:mbox}:
<>=
boxes(dmM, boxpos = list(x = c(15, 80, 40, 40, 85, 85),
y = c(50, 50, 90, 10, 90, 10)),
scale.R = 1000, show.BE = TRUE)
@ %
\insfig{mbox}{1.0}{Boxes for the \textrm{\tt dmM} object. The numbers
\emph{in} the boxes are person-years (middle), and below the number
of persons who start, respectively end their follow-up in each of
the states.}
We may not be interested in whether persons were prescribed insulin
before OAD or vice versa, in which case we would combine the levels
with both insulin and OAD to one, regardless of order (figure
\ref{fig:mboxr}):
<>=
summary(dmMr <- Relevel(dmM, list('OAD+Ins' = 5:6), first = FALSE))
boxes(dmMr, boxpos = list(x = c(15, 50, 15, 85, 85),
y = c(85, 50, 15, 85, 15)),
scale.R = 1000, show.BE = TRUE)
@ %
\insfig{mboxr}{1.0}{Boxes for the \textrm{\tt dmMr} object with
collapsed states. The numbers \emph{in} the boxes are person-years
(middle), and below the number of persons who start, respectively
end their follow-up in each of the states.}
Diagrams as those in figures
\ref{fig:mbox} and
\ref{fig:mboxr} gives an overview of the possible transitions,
which states it might be relevant to collapse, and which transitions
to model and how.
The actual modeling of the transition rates is straightforward;
split the data along some timescale and then use \texttt{glm.Lexis} or
\texttt{gam.Lexis}, where it is possible to select the transitions
modeled. This is also possible with the \texttt{coxph.Lexis}
function, but it requires that a single time scale be selected as the
baseline time scale, and the effect of this will not be accessible.
\chapter{\texttt{Lexis} functions}
\hypertarget{lexfun}{The \texttt{Lexis} machinery} has evolved over
time since it was first introduced in a workable version in
\texttt{Epi\_1.0.5} in August 2008.
Over the years there have been additions of tools for handling
multistate data. Here is a list of the current functions relating to
\texttt{Lexis} objects with a very brief description; it does not
replace the documentation. Unless otherwise stated, functions named
\texttt{something.Lexis} (with a ``\texttt{.}'') are S3 methods for
\texttt{Lexis} objects, so you can skip the ``\texttt{.Lexis}'' in
daily use.
\setlist{noitemsep}
\begin{description}
\item[Define]\ \\
\begin{description}
\item[\texttt{Lexis}] defines a \texttt{Lexis} object
\end{description}
\item[Cut and split]\ \\
\begin{description}
\item[\texttt{cutLexis}] cut follow-up at intermediate event
\item[\texttt{mcutLexis}] cut follow-up at multiple intermediate
events, keeping track of history
\item[\texttt{rcutLexis}] cut follow-up at intermediate, possibly
recurring, events, only recording the current state
\item[\texttt{countLexis}] cut follow-up at intermediate event time and count
the no. events so far
\item[\texttt{splitLexis}] split follow up along a time scale
\item[\texttt{splitMulti}] split follow up along a time scale --- from
the \texttt{popEpi} package, faster and has simpler syntax than
\texttt{splitLexis}
\item[\texttt{addCov.Lexis}] add clinical measurements at a given date to a
\texttt{Lexis} object
\item[\texttt{addDrug.Lexis}] add drug exposures to a \texttt{Lexis} object
\end{description}
\item[Boxes and plots]\ \\
\begin{description}
\item[\texttt{boxes.Lexis}] draw a diagram of states and transitions
\item[\texttt{legendbox}] draw a box explaining the numbers output by \texttt{boxes.Lexis}
\item[\texttt{plot.Lexis}] draw a standard Lexis diagram
\item[\texttt{points.Lexis}] add points to a Lexis diagram
\item[\texttt{lines.Lexis}] add lines to a Lexis diagram
\item[\texttt{PY.ann.Lexis}] annotate life lines in a Lexis diagram
\end{description}
\item[Summarize and query]\ \\
\begin{description}
\item[\texttt{summary.Lexis}] overview of transitions, risk time etc.
\item[\texttt{levels.Lexis}] what are the states in the \texttt{Lexis} object
\item[\texttt{nid.Lexis}] number of persons in the \texttt{Lexis}
object --- how many unique values of \texttt{lex.id} are present
\item[\texttt{entry}] entry time
\item[\texttt{exit}] exit time
\item[\texttt{status}] status at entry or exit
\item[\texttt{timeBand}] factor of time bands
\item[\texttt{timeScales}] what time scales are in the \texttt{Lexis} object
\item[\texttt{timeSince}] what time scales are defined as time since a given state
\item[\texttt{breaks}] what breaks are currently defined
\item[\texttt{absorbing}] what are the absorbing states
\item[\texttt{transient}] what are the transient states
\item[\texttt{preceding}, \texttt{before}] which states precede this
\item[\texttt{succeeding}, \texttt{after}] which states can follow this
\item[\texttt{tmat.Lexis}] transition matrix for the \texttt{Lexis} object
\end{description}
\item[Manipulate]\ \\
\begin{description}
\item[\texttt{subset.Lexis}, \texttt{[}] subset of a \texttt{Lexis} object
\item[\texttt{merge.Lexis}] merges a \texttt{Lexis} objects with a
\texttt{data.frame}
\item[\texttt{cbind.Lexis}] bind a \texttt{data.frame} to a \texttt{Lexis} object
\item[\texttt{rbind.Lexis}] put two \texttt{Lexis} objects head-to-foot
\item[\texttt{transform.Lexis}] transform and add variables
\item[\texttt{tsNA20}] turn \texttt{NA}s to 0s for time scales
\item[\texttt{factorize.Lexis}] turn \texttt{lex.Cst} and
\texttt{lex.Xst} into factors with levels equal to the actually
occurring values in both
\item[\texttt{Relevel.Lexis}] reorder and/or combine states
\item[\texttt{rm.tr}] remove transitions from a \texttt{Lexis} object
\item[\texttt{bootLexis}] bootstrap sample of \emph{persons}
(\texttt{lex.id}) in the \texttt{Lexis} object
\item[\texttt{unLexis}] remove \texttt{Lexis} attributes from a
\texttt{Lexis} object
\end{description}
\item[Simulate]\ \\
\begin{description}
\item[\texttt{simLexis}] simulate a \texttt{Lexis} object from
specified transition rate models
\item[\texttt{nState}, \texttt{pState}] count state occupancy from a
simulated \texttt{Lexis} object
\item[\texttt{plot.pState}, \texttt{lines.pState}] plot state occupancy from a
\texttt{pState} object
\end{description}
\item[Stack]\ \\
\begin{description}
\item[\texttt{stack.Lexis}] make a stacked object for simultaneous
analysis of transitions --- returns a \texttt{stacked.Lexis} object
\item[\texttt{subset.stacked.Lexis}] subsets of a \texttt{stacked.Lexis} object
\item[\texttt{transform.stacked.Lexis}] transform a \texttt{stacked.Lexis} object
\end{description}
\item[Interface to other packages]\ \\
\begin{description}
\item[\texttt{msdata.Lexis}] interface to \texttt{mstate} package
\item[\texttt{etm.Lexis}] interface to \texttt{etm} package
\item[\texttt{crr.Lexis}] interface to \texttt{cmprsk} package
\end{description}
\item[Statistical models] --- these are \emph{not} S3 methods
\begin{description}
\item[\texttt{glm.Lexis}] fit a \texttt{glm} model using the
\texttt{poisreg} family to (hopefully) time split data
\item[\texttt{gam.Lexis}] fit a \texttt{gam} model (from the
\texttt{mgcv} package) using the \texttt{poisreg} family to
(hopefully) time split data
\item[\texttt{coxph.Lexis}] fit a Cox model to follow-up in a
\texttt{Lexis} object
\end{description}
\end{description}
<>=
ende <- Sys.time()
cat(" Start time:", format(anfang, "%F, %T"),
"\n End time:", format( ende, "%F, %T"),
"\nElapsed time:", round(difftime(ende, anfang, units = "mins"), 2), "minutes\n")
@ %
\renewcommand{\bibname}{References}
\bibliographystyle{plain}
\begin{thebibliography}{1}
\bibitem{Carstensen.2011a}
B~Carstensen and M~Plummer.
\newblock Using {L}exis objects for multi-state models in {R}.
\newblock {\em Journal of Statistical Software}, 38(6):1--18, 1 2011.
\bibitem{Iacobelli.2013}
S~Iacobelli and B~Carstensen.
\newblock {Multiple time scales in multi-state models}.
\newblock {\em Stat Med}, 32(30):5315--5327, Dec 2013.
\bibitem{Plummer.2011}
M~Plummer and B~Carstensen.
\newblock Lexis: An {R} class for epidemiological studies with long-term
follow-up.
\newblock {\em Journal of Statistical Software}, 38(5):1--12, 1 2011.
\end{thebibliography}
\addcontentsline{toc}{chapter}{\bibname}
\end{document}