Evaluating diagnostic classification models
Source:vignettes/articles/model-evaluation.Rmd
model-evaluation.Rmd
In this article, we will describe the different options for evaluating diagnostic classification models (DCMs; also known as cognitive diagnostic models [CDMs]) using measr. We start with the data to analyze, estimate our DCM, and learn how to evaluate different aspects of the model such as model fit and reliability.
To use the code in this article, you will need to install and load the measr package.
Example Data
To demonstrate the model fit functionality of measr, we’ll use the same simulated data set that was used to illustrate model estimation functionality. This data set contains 2,000 respondents and 20 items that measure a total of 4 attributes, but no item measures more than 2 attributes. The data was generated from the loglinear cognitive diagnostic model (LCDM), which is a general model that subsumes many other DCM subtypes (Henson et al., 2009). To demonstrate model fit functionality, we’ll first fit an LCDM and a deterministic-input, noisy “and” gate (DINA) model (de la Torre & Douglas, 2004) to the data set in order to compare the fit indices. Because the LCDM was used to generate our fake data, we expect our estimated LCDM model to perform well. On the other hand, the DINA model places heavy constraints on the LCDM, and therefore we expect worse performance from the DINA model. For details on model estimation, see Estimating diagnostic classification models.
library(tidyverse)
sim_data <- read_rds("data/simulated-data.rds")
lcdm <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix,
resp_id = "resp_id", item_id = "item_id",
type = "lcdm", method = "mcmc", backend = "cmdstanr",
iter_warmup = 1000, iter_sampling = 500,
chains = 4, parallel_chains = 4,
file = "fits/sim-lcdm")
dina <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix,
resp_id = "resp_id", item_id = "item_id",
type = "dina", method = "mcmc", backend = "cmdstanr",
iter_warmup = 1000, iter_sampling = 500,
chains = 4, parallel_chains = 4,
file = "fits/sim-dina.rds")
Model Evaluation
There are three major types of evaluations that are supported by measr.
- Absolute model fit
- Relative model fit (i.e., model comparisons)
- Reliability
Each of these are discussed in turn.
Absolute Model Fit
Absolute model fit measures how well the estimated model fits the
observed data. One of the more popular methods for evaluating absolute
model fit for DCMs is the M2 statistic (Hansen et al., 2016; Liu et al., 2016). We can calculate the
M2 statistics with the fit_m2()
function.
fit_m2(lcdm)
#> # A tibble: 1 × 8
#> m2 df pval rmsea ci_lower ci_upper `90% CI` srmsr
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 121. 129 0.670 0 0 0.0091 [0, 0.0091] 0.0166
This function returns a data frame with the M2 statistic
(m2
), and the degrees of freedom and p-value
associated with the M2 (df
and
pval
, respectively). A p-value less than .05 would
typically indicate poor model fit. As expected in this example, the
estimated LCDM shows adequate model fit to our data (p >
.05). The fit_m2()
also return information on the RMSEA and
SRMSR fit statistics. The M2, RMSEA, and SRMSR are all
considered limited-information fit indices. For example, the
M2 is based off of univariate item statistics and the
bivariate relationships between items. Thus, it cannot capture aspects
of model fit from higher-order relationships (e.g., triplets of
items).
Because we used a fully Bayesian estimation of our models, we can use the posterior distributions to provide another evaluation of model fit that can incorporate more information. This method is known as a posterior predictive model check (PPMC). The general process for PPMCs is as follows:
- For each draw of the posterior distribution, generate a synthetic data set using the parameters values for that draw.
- For each synthetic data set calculate a summary of the data. This process creates a posterior distribution of the summary. Because the synthetic data sets are generated from the parameter values in the posterior distribution, the distribution of the data summary represents what we would expect the summary to look like, if the estimated model is correct or true.
- Calculate the same data summary for the observed data set.
- Compare the summary from the observed data to the posterior distribution.
If the observed value falls within the posterior distribution of the summary, this is evidence that our estimated model is consistent with the observed data. On the other hand, discrepancies between the observed value and posteriors indicates inconsistencies.
The PPMCs can be used to evaluate both model- and item-level fit. At the model level, fit is evaluated by the expected raw score distribution, as described by Park et al. (2015) and Thompson (2019). At the item level, we can evaluate fit by examining the expected conditional probabilities of members of each class providing a correct response (Sinharay & Almond, 2007; Thompson, 2019), as well as the expected odds ratio between each pair of items (Park et al., 2015; Sinharay et al., 2006).
With measr, PPMCs can be calculated with fit_ppmc()
.
Using fit_ppmc()
can specify which PPMCs to calculate. For
example, here we estimate just the model-level raw score check by
setting item_fit = NULL
.
fit_ppmc(lcdm, model_fit = "raw_score", item_fit = NULL)
#> $model_fit
#> $model_fit$raw_score
#> # A tibble: 1 × 5
#> obs_chisq ppmc_mean `2.5%` `97.5%` ppp
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 24.2 23.3 11.1 39.7 0.402
fit_ppmc()
returns a list, where each element is a
different PPMC. Here, we only specified one PPMC, so we only get the
raw_score
element back. The obs_chisq
is the
raw score χ2 values from our observed data. We then see the
mean of the posterior distribution for the χ2, quantiles of
the posterior distribution, and the posterior predictive
p-value (ppp). The ppp is the proportion of
posterior draws that are greater than the observed value. Values close
to 0 indicate poor fit. Values close to 1 may indicate overfitting.
Because the LCDM was used to generate this data, it’s not surprising
that the ppp is approaching 1 for our estimated model, as the
model is perfectly capturing the data generating process.
We can also specify the posterior quantiles that are returned. For example, we can calculate the item-level odds ratios and request quantiles that will result in a 90% credible interval.
fit_ppmc(lcdm, model_fit = NULL, item_fit = "odds_ratio",
probs = c(0.05, 0.95))
#> $item_fit
#> $item_fit$odds_ratio
#> # A tibble: 190 × 7
#> item_1 item_2 obs_or ppmc_mean `2.5%` `97.5%` ppp
#> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A1 A2 2.61 2.73 2.13 3.49 0.614
#> 2 A1 A3 2.04 1.97 1.56 2.49 0.346
#> 3 A1 A4 0.904 1.07 0.841 1.34 0.914
#> 4 A1 A5 1.99 2.13 1.65 2.70 0.660
#> 5 A1 A6 3.27 3.23 2.47 4.26 0.421
#> 6 A1 A7 0.966 1.05 0.826 1.31 0.762
#> 7 A1 A8 6.78 6.58 4.84 8.90 0.390
#> 8 A1 A9 10.1 9.75 7.27 13.1 0.364
#> 9 A1 A10 2.98 2.86 2.14 3.77 0.364
#> 10 A1 A11 2.98 3.08 2.42 3.89 0.579
#> # ℹ 180 more rows
We see similar output for the item-level indices: the observed odds
ratio for each item pair (obs_or
), the mean of the
posterior distribution for each item pair (ppmc_mean
), the
quantiles of the posterior that we specified, and the ppp. As
with the raw score distribution, here the ppp represents the
proportion of posterior draws where the odds ratio is greater than the
observed value.
Relative Model Fit
Relative fit measures are used to compare multiple competing models. In this case, we have estimate an LCDM and a DINA model and want to compare which model performs better. We should first note that “better” does not necessarily mean “good.” Evidence that one model performs better than another is not evidence of “good” fit in an absolute sense. Only absolute model fit indices can provide evidence of adequate fit to the data. However, if multiple models show adequate absolute fit, relative fit indices can be used, along with our understanding of the constructs, to select a preferred model.
Currently, the Stan ecosystem supports two information
criteria through the loo package
that can be used as relative fit indices: leave-one-out (LOO) cross
validation with Pareto-smoothed importance sampling (Vehtari et al., 2017, 2022) and the widely applicable information
criterion (WAIC) described by Watanabe (2010). The information criteria can be
calculated using the associated functions from the loo package (i.e.,
loo()
and waic()
). Here, we calculate the LOO
for both the LCDM and DINA models.
lcdm_loo <- loo(lcdm)
lcdm_loo
#>
#> Computed from 2000 by 2000 log-likelihood matrix.
#>
#> Estimate SE
#> elpd_loo -21731.8 102.7
#> p_loo 78.1 1.2
#> looic 43463.6 205.3
#> ------
#> MCSE of elpd_loo is 0.2.
#>
#> All Pareto k estimates are good (k < 0.7).
#> See help('pareto-k-diagnostic') for details.
dina_loo <- loo(dina)
dina_loo
#>
#> Computed from 2000 by 2000 log-likelihood matrix.
#>
#> Estimate SE
#> elpd_loo -23579.8 83.2
#> p_loo 831.0 18.7
#> looic 47159.7 166.4
#> ------
#> MCSE of elpd_loo is 0.6.
#>
#> All Pareto k estimates are good (k < 0.7).
#> See help('pareto-k-diagnostic') for details.
In isolation, this output is not very useful, as these indices are
meant to facilitate model comparisons. We can conduct model comparisons
using loo_compare()
, which is used for comparing both LOO
and WAIC estimates. In the output, the model in the first row is the
preferred model. In subsequent rows, the epld_diff
column
reports the difference in the information criteria (in this case the
LOO) between the model in that row and the preferred model. The
se_diff
column is the standard error of the difference
between that model and the preferred model.
loo_compare(list(lcdm = lcdm_loo, dina = dina_loo))
#> elpd_diff se_diff
#> lcdm 0.0 0.0
#> dina -1848.1 49.2
loo_comp <- loo_compare(list(lcdm = lcdm_loo, dina = dina_loo))
Bengio & Grandvalet (2004) have recommended a cutoff of 2.5
standard errors for identifying the preferred model. For example,
because the absolute value of the elpd_diff
is greater than
49.2 × 2.5 = 123, we would conclude that the LCDM fits significantly
better than the DINA model. This is our expected outcome in this case,
given the data generation process and the results of the absolute model
fit analysis. If the difference were less than 2.5 standard errors, we
would conclude that the models fit equally well.
Reliability
We can also evaluate DCMs through their reliability. That is, it’s
important to understand the accuracy and consistency of the
classifications that are made by the model. For models estimated with
measr, estimates of reliability can be calculated using
reliability()
.
reliability(lcdm)
#> $pattern_reliability
#> p_a p_c
#> 0.7441110 0.6093565
#>
#> $map_reliability
#> $map_reliability$accuracy
#> # A tibble: 4 × 8
#> attribute acc lambda_a kappa_a youden_a tetra_a tp_a tn_a
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 att1 0.931 0.857 0.245 0.862 0.976 0.929 0.933
#> 2 att2 0.980 0.955 0.944 0.960 0.998 0.981 0.979
#> 3 att3 0.835 0.633 0.650 0.659 0.868 0.882 0.777
#> 4 att4 0.966 0.931 0.396 0.932 0.994 0.968 0.964
#>
#> $map_reliability$consistency
#> # A tibble: 4 × 10
#> attribute consist lambda_c kappa_c youden_c tetra_c tp_c tn_c gammak
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 att1 0.873 0.738 0.860 0.746 0.922 0.869 0.877 0.898
#> 2 att2 0.960 0.909 0.935 0.918 0.992 0.955 0.964 0.970
#> 3 att3 0.761 0.422 0.622 0.507 0.717 0.796 0.711 0.771
#> 4 att4 0.936 0.870 0.932 0.871 0.980 0.935 0.936 0.948
#> # ℹ 1 more variable: pc_prime <dbl>
#>
#>
#> $eap_reliability
#> # A tibble: 4 × 5
#> attribute rho_pf rho_bs rho_i rho_tb
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 att1 0.800 0.796 0.647 0.949
#> 2 att2 0.936 0.938 0.717 0.995
#> 3 att3 0.619 0.537 0.480 0.748
#> 4 att4 0.902 0.897 0.700 0.987
There are several types of reliability evidence that are provided in
the output. For all indices reported, values can range from 0 to 1,
where 1 represents perfect accuracy or consistency. The first type
reliability in the output is pattern-level reliability, as described by
Cui et al. (2012). This reliability describes the
accuracy (p_a
) and consistency (p_c
) of the
classification of respondents into an overall profile of proficiency on
the on assessed skills. For example, in a 3-attribute assessment there
are 8 possible profiles: [0,0,0], [1,0,0], [0,1,0], [0,0,1], [1,1,0],
[1,0,1], [0,1,1], [1,1,1]. One option for reporting results from a
DCM-based assessment is to select the profile that is most likely for
each respondent. These pattern-level reliability metrics provide a
measure of the accuracy and consistency for this type of
classification.
On the other hand, rather than reporting results based on the overall
most likely profile, we can assign proficiency for each individual
attribute, and build up a overall profile from the attribute-level
decisions. This type of reporting may or may not result in the same
profile as the overall most likely profile. Because in this scenario
classifications are made at the attribute level, we need to examine the
classification accuracy and consistency for each individual attribute.
For models estimated with measr, this is referred to maximum a
posteriori (MAP) reliability, because classifications are based on
the most likely category for each attribute for each respondent. The MAP
values reported in the output are those described by Johnson & Sinharay (2018). The acc
and
consist
variables in map_reliability$accuracy
and map_reliability_consistency
tables, respectively, are
the classification accuracy and consistency metrics described by Johnson & Sinharay (2018). In their paper, they also
compared these metrics to other measures of agreement (e.g., Cohen’s κ,
Goodman and Kruskal’s λ), which are also included in the output. Johnson & Sinharay (2018) also provide recommendations
for threshold values for each metric that represent poor, fair, good,
very good, or excellent reliability.
Finally, rather than reporting results as classifications (e.g.,
proficient/not proficient), results can also be reported simply as the
probability that each respondent is proficient on each attribute. Thus,
rather than reporting the accuracy and consistency of a classification,
we should report the precision of the reported probability. This is
referred to as expected a posteriori (EAP) reliability, because
the probability represents the expected value of each attribute for each
respondent. The EAP values reported in the output
(eap_reliability
) are those described by Johnson & Sinharay (2020) and Templin & Bradshaw (2013). Templin
& Bradshaw (2013) describe a
reliability index based on a restrictive assumption of parallel forms
(rho_tb
). Johnson & Sinharay (2020) described a more generalized
parallel forms reliability (rho_pf
), along with additional
biserial (rho_bs
), and informational (rho_i
)
reliability indices. Because proficiency probabilities are provided at
the attribute level, the EAP reliability estimates are also provided for
each attribute measured by the assessment. As with the classification
reliability indices, Johnson & Sinharay (2020) provide recommendations for
thresholds representing poor, fair, good, very good, and excellent
reliability for all four EAP reliability indices.
Storing Model Evaluations
If you have followed along with running the code in this vignette,
you will have noticed that some of the model evaluations take a
significant amount of computation time. This means repeating the
calculations (e.g., didn’t assign the output new a new object, opened a
new R session) can be a very time-consuming process. To make you
analysis more efficient, measr offers several functions that can be used
to add the model evaluation metrics described in this vignette directly
to the model object. If you specified a file
when the model
was estimated, the updated model object with the model evaluation
components will automatically resave, ensuring that you don’t have to
rerun computationally intensive tasks.
There are three functions for adding model evaluation components to a model object, which correspond to the three types of evaluation described in this vignette:
-
add_fit()
: Adds absolute model fit indices (i.e., M2, PPMCs) -
add_criterion()
: Adds relative model fit indices (i.e., LOO, WAIC) -
add_reliability()
: Adds reliability metrics
All three functions have several arguments in common.
-
x
: The model to add evaluation components to. -
save
: Whether to resave the model object iffile
was specified when estimating the model. The default isTRUE
. -
overwrite
: Whether to overwrite existing evaluations. For example, if you attempt to add reliability metrics withadd_reliability()
, but those metrics have already been added, should the reliability metrics be recalculated and overwrite the existing metrics? The default isFALSE
.
Additionally, all three functions have a ...
argument
for passing additional arguments along to the relevant functions. For
example, if we want to add PPMC absolute model fit indices, we can
specify the types of model- and item-level fit indices to calculate,
just as we did when using fit_ppmc()
.
lcdm <- add_fit(lcdm, method = "ppmc",
model_fit = "raw_score",
item_fit = "odds_ratio")
Once components have been added to the model, a helper function,
measr_extract()
, can be used to pull out relevant pieces of
output. For example, we can extract the PPMC raw score results.
measr_extract(lcdm, what = "ppmc_raw_score")
#> # A tibble: 1 × 5
#> obs_chisq ppmc_mean `2.5%` `97.5%` ppp
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 24.2 23.3 11.1 39.7 0.402
In addition, we can also extract elements of the model, such as the priors that were used during estimation, or estimated parameters like the base rate of membership in each class.
measr_extract(lcdm, what = "prior")
#> # A tibble: 4 × 3
#> class coef prior_def
#> <chr> <chr> <chr>
#> 1 intercept NA normal(0, 2)
#> 2 maineffect NA lognormal(0, 1)
#> 3 interaction NA normal(0, 2)
#> 4 structural Vc dirichlet(rep_vector(1, C))
measr_extract(lcdm, what = "strc_param")
#> # A tibble: 16 × 2
#> class estimate
#> <chr> <rvar[1d]>
#> 1 [0,0,0,0] 0.059 ± 0.0068
#> 2 [1,0,0,0] 0.080 ± 0.0078
#> 3 [0,1,0,0] 0.063 ± 0.0064
#> 4 [0,0,1,0] 0.083 ± 0.0078
#> 5 [0,0,0,1] 0.052 ± 0.0087
#> 6 [1,1,0,0] 0.044 ± 0.0058
#> 7 [1,0,1,0] 0.051 ± 0.0067
#> 8 [1,0,0,1] 0.060 ± 0.0105
#> 9 [0,1,1,0] 0.083 ± 0.0080
#> 10 [0,1,0,1] 0.049 ± 0.0078
#> 11 [0,0,1,1] 0.081 ± 0.0094
#> 12 [1,1,1,0] 0.043 ± 0.0067
#> 13 [1,1,0,1] 0.043 ± 0.0092
#> 14 [1,0,1,1] 0.092 ± 0.0114
#> 15 [0,1,1,1] 0.047 ± 0.0083
#> 16 [1,1,1,1] 0.068 ± 0.0095
For a complete list of what can be extract with
measr_extract()
, see ?measr_extract
.