dexterMST: dexter for Multi-Stage Tests

Timo Bechger, Jesse Koops, Robert Zwitser, Ivailo Partchev, Gunter Maris

2021-01-06

dexterMST is a new R package acting as a companion to dexter (Maris et al. 2018) and adding facilities to manage and analyze data from multi-stage tests (MST) as they are found in educational measurement (Yan, Lewis, and Davier 2014). The package includes functions for importing and managing test data, assessing and improving the quality of data through basic test and item analysis, and fitting an IRT model; all adapted to the peculiarities of MST designs. Its main contribution is in the analysis of the data. It offers, in particular, the possibility to calibrate item parameters from MST using either Conditional Maximum Likelihood (CML) estimation (Zwitser and Maris 2015) or a Gibbs sampler for Bayesian inference (Koops, Bechger, and Maris, n.d.). It has, for instance, no facilities for automatic test assembly.

What does it do?

MST must be historically the earliest attempt to achieve adaptivity in testing. In a traditional, non-adaptive test, the items that will be given to the examinee are completely known before testing has started, and no items are added until it is over. In adaptive testing, the items asked are, at least to some degree, contingent on the responses given, so the exact contents of the test only becomes known at the end. (Bejar 2014) gives a nice overview of early attempts at adaptive testing in the 1950s. Other names for adaptive testing used in those days were tailored testing or response-contingent testing. Note that MST can be done without any computers at all, and that computer-assisted testing does not necessarily have to be adaptive.

When computers became ubiquitous, full-scaled computerized adaptive testing (CAT) emerged as a realistic option. In CAT, the subject’s ability is typically reevaluated after each item and the next item is selected out of a pool, based on the interim ability estimate. In MST, adaptivity is not so fine-grained: items are selected for administration not separately but in bunches, usually called modules. In the first stage of a MST, all respondents take a routing test. In subsequent stages, the modules they are given depend on their success in previous modules: test takers with high scores are given more difficult modules, and those with low scores are given easier ones – see e.g., Zenisky, Hambleton, and Luecht (2009), Hendrickson (2007), or Yan, Lewis, and Davier (2014).

To get closer to actual work with MST, it is convenient to represent the test design with a tree diagram. A very simple example is shown below:

The tree is read from top to bottom. The root represents the first stage where all examinees take the first module (the routing test). In the second stage, examinees with a score lower than or equal to 5 on the routing test take module 2, whereas examinees with a score higher than 5 on the routing test take module 3. Every path (from the routing test to the last module) corresponds to a booklet. In this MST, there are two booklets: the first one, booklet M1-M2, should be relatively easy while the other, M1-M3, is more difficult. Note that the thickness of a path indicates how many respondents took it.

Unlike CAT, MST is not all about algorithms. Most concepts, steps, procedures known from linear testing and familiar from dexter are still in place, but in a modified form: some are a bit more complicated, others quite a bit, a few become meaningless. We next review the basic workflow, which is similar to dexter with a few important additions, and we discuss some of the differences in more detail.

How do I use it?

There are not too many workflows to manage and analyze test data. In dexter, the most common procedure is, basically:

  1. Start a project
  2. Declare the scoring rules for all items
  3. Input data
  4. Examine data, assess quality with classical statistics, make necessary adjustments
  5. Estimate item parameters
  6. DIF, profile analysis …
  7. Estimate and analyze proficiencies

In dexterMST, we follow more or less the same path except that we must, between steps 2 and 3, communicate the MST structure to the program: the modules, the booklets, and the routing rules.

Create a MST project

The first step in dexterMST is to create a new project (actually, an empty data base). In this example, the project is created in memory, and it will be lost when you close R. To create a permanent project, simply replace the string “:memory:” by a file name.

db = create_mst_project(":memory:")

Supply the scoring rules

Just like in dexter, the first really important step is to supply the scoring rules: an exhaustive list of all items that will appear in the test, all admissible responses, and the score that will be assigned to each response when grading the test. These must be given as a data.frame with three columns: item_id, response and item_score: the first two are strings, and the scores are integers with always 0 as the lowest possible score.

If you have scored data, you can simply make the column response match the item_score, as in the following example:

data.frame scoring_rules
item_id response item_score
item01 0 0
item01 1 1
item02 0 0
item02 1 1
item03 0 0
item03 1 1
item04 0 0
item04 1 1
add_scoring_rules_mst(db, scoring_rules)

Define the test design

In the simpler case of multi-booklet linear tests, dexter is able to infer the test design from the scoring rules and the test data. With MST, we have to work some more and provide information on the modules and the routing rules.

First, the modules. Create another data.frame with columns module_id, item_id and item_position:

data.frame design
module_id item_id item_position
Mod_2 item01 1
Mod_2 item02 2
Mod_2 item03 3
Mod_2 item04 4
Mod_2 item05 5
Mod_2 item06 6
Mod_2 item07 7
Mod_2 item08 8

Note that the items have been sorted in difficulty which is why module 2 contains the first items.

Routing rules specify the rules to pass from one module to the next. We have supplied a function, mst_rules, which lets you define the routing rules using a simple syntax. The following example defines two booklets (remember, booklets are paths) called “easy” and “hard”. The “easy” booklet consists of the routing test, here called Mod_1, and module Mod_2; it is given to examinees who scored between 0 and 5 on the routing test. Booklet “hard” consists of the routing test and module Mod_3, and is given to examinees who scored between 6 and 10 on the routing test. Obviously, the command language is a simple transcription of the tree diagram on the previous illustration: read --+ as arrow from left to right and [0:5] as a score range, here from zero up to and including five.

routing_rules = mst_rules(
  easy = Mod_1[0:5] --+ Mod_2, 
  hard = Mod_1[6:10] --+ Mod_3)

Having defined the two crucial elements of the design, the modules and the routing rules, use function create_mst_test to combine them and give the test a name, in this case ZwitserMaris:

create_mst_test(db,
                test_design = design,
                routing_rules = routing_rules,
                test_id = 'ZwitserMaris')

Currently, we support two possible types of routing, all and last, with all the default. The difference lies in whether routing is based the score obtained on the last module a person took, or on all previous modules. We discuss this in detail below.

Enter test data

With the test defined, you can enter data. This can be done in two ways: booklet per booklet in wide form, or all booklets at once. The former is illustrated below; the latter works if the data is in long format (also called normalized or tidy, see (Wickham and Grolemund 2017)).

example data in wide format (bk1 below)
person_id item01 item02 item03 item04 item05 item06
3 1 1 1 1 1
5 1 1 1 1 1
7 1 1 1 0 0
12 0 1 1 1 1

To enter the data in wide format, we call function add_booklet_mst twice, once for each booklet:

add_booklet_mst(db, bk1, test_id = 'ZwitserMaris', booklet_id = 'easy')
add_booklet_mst(db, bk2, test_id = 'ZwitserMaris', booklet_id = 'hard')

Inspect and analyze the data

Before we attempt to fit IRT models to the data, it is common practice to compute and examine carefully some simpler statistics derived from Classical Test Theory (CTT). IRT and CTT have largely overlapping ideas of what constitutes a good item or a good test, derived from common substantive foundations. If, for example, we find that the scores on an item correlate negatively with the total scores on the test, this is a sign that something is seriously amiss with the item. The presence of such problematic items will decrease the value of Cronbach alpha, and so on.

Unfortunately, CTT statistics are all badly influenced by the score range restrictions and dependencies inherent in MST designs. Therefore, their usefulness is severely limited, except perhaps in the first module of a test. The good news is that the interaction model (Haberman 2007), which we advocated in dexter as a model-driven alternative to CTT, can be adapted to MST designs, making it possible to retrieve the item-regression curves conditional on the routing design. This is best appreciated in graphical form:

fi = fit_inter_mst(db, test_id = 'ZwitserMaris', booklet_id = 'hard')

plot(fi, item_id='item21')
plot(fi, item_id='item45')

The plots are similar to those in dexter except that some scores are ruled out due to the design of the test. The interaction model can only be fitted on one booklet at a time, but this includes the rather complicated MST booklets. A more detailed discussion of the item-total regressions may be found in dexter’s vignettes or on our blog.

Estimate the IRT model

Similar to dexter, dexterMST supports the Extended Nominal Response Model (ENORM) as the basic IRT model. To the user, this looks and feels like the Rasch model when items are dichotomous, and as the partial credit model otherwise. Fitting the model is as easy as can be:

f = fit_enorm_mst(db)

coef(f)
some item parameters fit on multi stage data
item_id item_score beta SE_beta
item01 1 -0.477 0.034
item02 1 -1.919 0.038
item03 1 -2.534 0.042
item04 1 -2.291 0.040
item05 1 -0.691 0.034
item06 1 -0.433 0.035
item07 1 -2.271 0.040
item08 1 -2.459 0.041

What happens under the hood is not simple, so we discuss it as some more length in a separate section below.

DIF etc.

dexterMST does include, as we write, generalizations of the exploratory test for DIF known from dexter (Bechger and Maris 2015) and of profile analysis (Verhelst 2012). We feel that these are a bit beyond fundamentals, and suffice with an example.

Let us add an invented item property using the add_item_properties_mst function and use profile_tables_mst to calculate the expected score on each item domain given the booklet score.

The following plot shows these expected domain-scores for each of the two booklets.

The plot shows the expected domain scores as lines and the average domain-scores found in the data as dots. For comparison, the dashed lines are the expected domain scores calculated using dexter. These are not correct because they ignore the design.

Ability estimation

dexterMST re-exports a number of dexter functions that can work with the parameters object returned from fit_enorm_mst, notably ability, ability_tables, and plausible_values. In the example below, we use maximum likelihood estimation (MLE) to produce a score transformation table:

abl = ability_tables(f, method='MLE')
abl
score transformation table (abl)
booklet_id booklet_score theta se
ZwitserMaris-easy 0 -Inf
ZwitserMaris-easy 1 -4.78 1.030
ZwitserMaris-easy 2 -4.03 0.750
ZwitserMaris-easy 3 -3.56 0.630
ZwitserMaris-easy 4 -3.21 0.562
ZwitserMaris-easy 5 -2.92 0.517
ZwitserMaris-easy 6 -2.67 0.485
ZwitserMaris-easy 7 -2.44 0.462
ZwitserMaris-easy 8 -2.24 0.445
ZwitserMaris-easy 9 -2.05 0.431

More examples are given below.

Subsetting: using predicates

dexter implements a flexible and general infrastructure for subsetting data via the predicate argument, which is available in many of the package functions. Predicates can use item properties, person covariates, booklet and item IDs, and other variables to filter the data that will be processed by the function.

We have tried very hard to preserve this mechanism in dexterMST. For example, the same analysis as above but without item item21 is done as follows:

f2 = fit_enorm_mst(db, item_id != 'item21')

However, because of the intricate dependencies in MST designs, subsetting is not trivial. We have provided some explanations in a separate section of this document.

This concludes our brief tour of a typical workflow with dexterMST. The rest of this document will examine in more detail CML estimation in dexterMST, how to specify designs with more than two stages, and some intricacies with the use of predicates. We conclude with a brief overview of the main differences between dexter and dexterMST.

CML estimation with MST

One should be careful not to apply dexter’s estimation routines in MST without thinking. Ordinary CML, in particular, is known to gives biased results under the circumstances (Eggen and Verhelst (2011), Glas (1988)). Recently, Zwitser and Maris (2015) demonstrated that CML estimation is possible, provided that one takes the design into account. Furthermore, they argued that sensible models aka those that admit CML will in general fit quite well to MST data. The same theory was used to adapt dexter’s Bayesian estimation method to MST (Koops, Bechger, and Maris, n.d.).

In the three years since, the results of Zwitser and Maris (2015) have not been mentioned in any of the recent edited volumes on MST, and dexterMST is, to our best knowledge, the first publicly available attempt at a practical implementation. MST data are usually analyzed with marginal maximum likelihood (MML), which is available in a number of R-packages, such as mirt. MML estimation makes assumptions about the shape of the ability distribution (usually a normal distribution is assumed), and it can produce unbiased estimates if these assumptions are fulfilled. CML, on the other hand, does not need any such assumptions at all, so it can be expected to perform well in a broader class of situations.

That MML gives biased results if the ability distribution is misspecified has been shown quite convincingly by Zwitser and Maris (2015). We reproduce their example here without the code but note that the data have been simulated with an ability distribution that is not normal but a mixture of two normals. Here is a density plot.

Note that a distribution that is not normal is not, in any way, ab-normal. In education, skewed or multi-modal distributions like this do occur as a result of many kinds of selection procedures. Below are the estimated item difficulty parameters plotted against the true parameters:

The results illustrate the well-known fact that both naive CML and MML estimates can be severely biased. The latter are not biased because of the MST design but because the population distribution was misspecified.

Note that the colors indicate whether the items occurred in module 1, module 2 or module 3. It will be clear that the modules are appropriately, albeit not perfectly, ordered in difficulty. Judging from the p-values, the (simulated) respondents would have been comfortable with the test.

tia_tables(get_responses_mst(db))$testStats %>%
  select(booklet_id, meanP)
tia_tables(get_responses_mst(db))$testStats %>%
  select(booklet_id, meanP) %>%
  kable(caption='mean item correct')
mean item correct
booklet_id meanP
hard 0.484
easy 0.524

Note how the dexter function tia_tables was used. To wit, we first got the response data using get_responses_mst and used these as input. This detour was necessary because the data-bases of dexter and dexterMST are not directly compatible.

In the same way, we can use dexter’s plausible_values function to show that the original distribution of person abilities can be approximated quite well by the distribution of the plausible values:

rsp_data = get_responses_mst(db)
pv = plausible_values(rsp_data, parms = f)

plot(density(pv$PV1), main='plausible value distribution', xlab='pv')

Note that the output from fit_enorm_mst can be used in dexter functions without further tricks. For further reading, we refer the reader to the dexter vignette on plausible values.

Predicates and MST

Response-contingent testing, to use the charming term from the past, introduces many intricate constraints and dependencies in the test design. As a result, not only the complicated techniques, but even such apparently trivial operations as removing items from the analysis or changing a scoring rule can become something of a minefield. Things are never as easy as in a linear test!

In CML calibration, it is essential that we know the routing rules so, to remove some items from analysis, one needs to infer the MST design without these items. What happens internally in dexterMST is that a new MST is defined for each possible score over the items that are left out, with the routing rules that follow from the ones originally specified. Consider, for example, the design that corresponding to the analysis with item21 left out.

design_plot(db, item_id!="item21")

As one can see, the tree is split to accommodate the examinees who answered this item correctly, and those who did not. While more complex predicates are allowed, it will be clear that these may involve complicated bookkeeping which takes time and might slow down the analysis.

Some limitations remain, unfortunately. At the time of writing, it is not possible to change the scoring rules after test administration, except for items in the last modules of the test. Predicates that remove complete modules from a test, e.g. module_id != 'Mod_1' will cause an error and should be avoided.

Beyond two stages: ‘all’ and ‘last’ routing

So far, we have considered the simplest MST design with just two stages. dexterMST can handle much more complex designs involving many stages and modules. As an example, we show a diagram corresponding to one of the MST tests used in the 2018 edition of Cito’s Adaptive Central End Test (ACET):

One ACET test

With more than two stages, two different methods of routing become possible:

Last: Use the score on the present module only.
All: Use the score on the present and the previous modules.

dexterMST fully supports both types of routing. We do require that a routing type applies to a complete test and is specified when the test is created, for example create_mst_test(..., routing='last'). In this vignette we used ‘all’, which is the default value.

It is worth noting that the ACET project includes both MST and linear tests. The linear tests are simply entered as a single module MST with a trivial routing rule, e.g.:

lin.test_design = data.frame(module_id='mod1', item_id=paste('item',1:30), item_positon = 1:30)
lin.rules = mst_rules(lin.booklet = mod1)
create_mst_test(db, lin.test_design, lin.rules, test_id = 'linear test')

ACET is a large project; although even larger projects exist such as the PIAAC study Economic Co-operation and Development) (2013). In total, the project database contains the responses of 97225 students to 3622 items spread over 169 tests and including six distinct MSTs. Could dexterMST analyse the data? Sure! On a Windows 64-bit laptop with an 2.9 GHz processor, this took about 1.5 minutes to calibrate.

dexter vs. dexterMST: A summary for dexter users

dexterMST is a companion to dexter. It loads dexter automatically, and many of that dexter’s functions can be used immediately, notably those for ability estimation. When that is not the case, there will be some kind of warning. The new functions relevant for MST have mst in their names. In addition, we have tried to keep the general logic and workflow as similar to dexter as possible. Thus, experienced dexter users should find dexterMST easy to understand. Some of the most important differences are listed below.

For your convenience, a function import_from_dexter is available that will import items, scoring rules, persons, test designs and responses from a dexter database into the dexterMST database.

References

Bechger, Timo M., and Gunter Maris. 2015. “A Statistical Test for Differential Item Pair Functioning.” Psychometrika 80 (2): 317–40.

Bejar, Isaac I. 2014. “Past and Future of Multistage Testing in Educational Reform.” Computerized Multistage Testing: Theory and Applications. New York: Chapman & Hall.

Economic Co-operation, OECD (Organisation for, and Development). 2013. “Technical Report of the Survey of Adult Skills (Piaac).” OECD Paris.

Eggen, T. J. H. M., and N. D. Verhelst. 2011. “Item Calibration in Incomplete Designs.” Psicologica 32: 107–32.

Glas, CAW. 1988. “The Rasch Model and Multistage Testing.” Journal of Educational Statistics 13 (1): 45–52.

Haberman, Shelby J. 2007. “The Interaction Model.” In Multivariate and Mixture Distribution Rasch Models: Extensions and Applications, edited by M. von Davier and C. H. Carstensen, 201–16. New York: Springer.

Hendrickson, Amy. 2007. “An Ncme Instructional Module on Multistage Testing.” Educational Measurement: Issues and Practice 26 (2): 44–52.

Koops, Jesse, Timo Bechger, and Gunter Maris. n.d. “Bayesian Inference for Multistage and Other Incomplete Designs.” In Research for Practical Issues and Solutions in Computerized Multistage Testing, edited by A. A. von Davier and Y. Duanli, 201–16. London: Routledge.

Maris, Gunter, Timo Bechger, Jesse Koops, and Ivailo Partchev. 2018. Dexter: Data Management and Analysis of Tests. https://CRAN.R-project.org/package=dexter.

Verhelst, Norman D. 2012. “Profile Analysis: A Closer Look at the PISA 2000 Reading Data.” Scandinavian Journal of Educational Research 56 (3): 315–32. https://doi.org/10.1080/00313831.2011.583937.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.

Yan, Duanli, Charles Lewis, and Alina A von Davier. 2014. “Overview of Computerized Multistage Tests.” Computerized Multistage Testing: Theory and Applications. New York: Chapman & Hall.

Zenisky, April, Ronald K Hambleton, and Richard M Luecht. 2009. “Multistage Testing: Issues, Designs, and Research.” In Elements of Adaptive Testing, 355–72. Springer.

Zwitser, Robert J., and Gunter Maris. 2015. “Conditional Statistical Inference with Multistage Testing Designs.” Psychometrika 80 (1): 65–84. https://doi.org/10.1007/s11336-013-9369-6.