1 Introduction and Setup

Latent class models (LCMs) are widely used by many fields of research to identify unobserved groups or “latent classes” based on shared characteristics. In epidemiology, for example, determining the population-level cause-specific case fractions (CSCFs) among the cases can enable researchers in understanding disease patterns and trends. It is often the case that the available data for analysis are only imperfect measures of the underlying true cause of disease (e.g. a test result is dependent on both the true presence/absence of the disease-causing agent(s) and the often imperfect true positive/false positive rate of the test). The models implemented by baker and described in this vignette are extensions of the classical LCM and have been designed for such research settings where the data are individual-level multivariate binary measurements of unobserved causes of disease, collected under a case-control design. baker can also incorporate covariate information via regression modeling.

1.1 Overview

This vignette describes and illustrates the functionality of the baker R package. The package provides a suite of nested partially-latent class models (NPLCM) for multivariate binary responses that are observed under a case-control design. The baker package allows researchers to flexibly estimate population- and individual-level class distributions that may also depend on additional explanatory covariates.

For example, the Pneumonia Etiology Research for Child Health (PERCH) study ([1]) seeks to use modern measurement technology to infer the causes of pneumonia for which gold-standard evidence is unavailable. Based on case-control data, NPLCM is a latent variable model designed to infer the etiology distribution for the population of cases (also termed as “cause-specific case fractions,” CSCFs), and for an individual case given her measurements. We assume each observation is drawn from a mixture model for which each component represents one disease class. Measurement precision and covariation can be estimated using the control sample for whom the class is known.

Functions in baker implement recent methodological developments in [2], [3], and [4]. Estimation is accomplished by calling a cross-platform automatic Bayesian inference software JAGS through a wrapper R function that parses model specifications and data inputs. The baker package provides many useful features, including data ingestion, exploratory data analyses, model diagnostics, extensive plotting and visualization options, catalyzing vital communications between practitioners and domain scientists. Package features and workflows are illustrated using simulated and real data sets.

The main contribution of the package is in multiple fold.

First, baker enables case-control analyses with or without covariates in the NPLCM framework. Case-control design is vital to valid scientific inference in many large-scale clinical and biomedical applications. For example, in a largest pediatric pneumonia etiology study to date [1], panel molecular diagnostic tests targeting multiple putative disease-causing agents may be performed on biological samples that are collected from subjects with clinically-defined disease (“cases”) and subjects without the said disease (“controls”), resulting in multivariate binary data under a case-control design. The control subjects have the observed class of not having the said disease and serve as statistical control to estimate the measurement specificity (one minus false positive rate) when interpreting the error-prone test results in the cases.

Second, baker can fit models under deviations from the classical local independence (LI) assumption in latent class analyses. Different from continuous random effects approach taken in [5], the methodology implemented by baker uses a parallel factor decomposition with a stick-breaking prior to enable parsimonious approximation of potential local dependence (LD) between the multivariate binary responses given any class. This enables faster computation and data-driven learning of the empirical LD structure.

Third, baker is designed to handle multiple sets of case-control or case-only measurements of distinct degrees of measurement error; these measurements are classified into two broad types: 1) Bronze-Standard (BrS) data that are available for both cases and controls but with imperfect sensitivity or specificity; and 2) Silver-Standard (SS) data that are only available among cases, with perfect specificity but imperfect sensitivity. Finally, the baker package conducts full Bayesian inference via Markov chain Monte Carlo (MCMC) by calling a cross-platform and versatile automatic Bayesian software JAGS ([6],[7]) via a wrapper function baker::nplcm() that parses model specifications and data inputs. The main quantities of interest are the population- and individual-level mixing probabilities among the cases. In doing so, baker quantifies the uncertainty associated with these estimates and provides visualization functions to assist in interpreting these results.

1.1.1 Vignette organization

The focus of this document is on guiding a new user to utilize some useful functions in baker for simulation studies and data analyses, aided by other powerful R packages. We refer readers of this document to the accompanying main software paper (link) for more details about the software design considerations and review of model formulations. A series of methodological papers ([2], [3], and [4]) may be useful for a deeper statistical dive. The NPLCM and the baker packages has been the central analytic basis for some real scientific studies, the first of which are demonstrated in [1] and [8]. Since baker’s first appearance on Github, the authors have not been able to track other recent substantive publications that have used this package; we hope the main software paper and this vignette serve as the definitive reference for future scientific studies that find the baker package useful.

The rest of the document is organized as follows. In Section 2, we illustrate how to use simulate_nplcm() to simulate and summarize data without and with covariates. In Section 3, we demonstrate how to specify models for baker to translate into JAGS model files and take in data; models without or with covariates are illustrated. In Section 4, for each model specified in Section 3, we demonstrate how to specify MCMC options and fit the models using the main function nplcm(). In Section 5, we describe how the model specification, results, and data are stored in folders. In Section 6, we demonstrate how to conduct MCMC convergence diagnostics and posterior predictive checking. In Section 7, generic functions summary(),print(), and plot() are introduced to summarize S3 R objects of class nplcm. The document concludes by illustrating the use of baker via analysis of a real data set in Section 8.

1.2 Configure the computing environment

1.2.1 JAGS

First, the user needs to install Just Another Gibbs Sampler (JAGS), which can be downloaded from http://mcmc-jags.sourceforge.net/; version 4.2.0. JAGS can be downloaded for either Windows or MAC OSX operating systems. This is necessary since baker uses the JAGS program for posterior inference via Markov chain Monte Carlo (MCMC).

  • JAGS installation for Windows users:
  1. Go to the website https://sourceforge.net/projects/mcmc-jags/files/JAGS/4.x/, select the folder Windows and download the file JAGS-4.3.0.exe.
  2. Open the .exe file and follow the instructions of the installer. Remember the folder path of your installation.
  3. After the installation, you need to add the folder path of JAGS to your Environment Variables. For example if JAGS is installed in C:/Program Files/JAGS/JAGS-4.3.0, then use the following command in the R console: Sys.setenv(JAGS_HOME="C:/Program Files/JAGS/JAGS-4.3.0")
  • JAGS installation for OSX users:
  1. Go to the website https://sourceforge.net/projects/mcmc-jags/files/JAGS/4.x/, select the folder Mac OS X and download the file JAGS-4.3.0.dmg.
  2. Open the .dmg file and in the pop-up window, open the file JAGS-4.3.0.mpkg. A warning will be given about the unidentified developer.
  3. Go to system preferences and then click the Security & Privacy icon. Under the “General” tab, allow downloaded apps by clicking the “Open Anyway” button.
  4. In the pop-up installer, follow the instructions and install JAGS on your macOS.

1.2.2 R and baker

Second, in R (>=4.0.2), the user needs to install the latest version of baker (0.9.??; source files accessible via https://github.com/zhenkewu/baker) by running the following code in R:

devtools::install_github("zhenkewu/baker").

In addition, other packages may need to be installed and loaded by R to use existing functionalities from other R packages. Packages that will be used for reproducing results in this document are listed here.

rm(list=ls())
library(baker)
library(reshape2)
library(lubridate)
library(rjags)
library(ggplot2)
library(R2WinBUGS)

2 Data simulation, summary, and preparation

We can use the function simulate_nplcm() to simulate measurement data for a variety of models in the NPLCM family by supplying the ground truth values of model parameters. The package contains some pre-simulated data sets and can be loaded by e.g., data(data_nplcm_noreg), data(data_nplcm_reg_nest), each of which is documented in the package.

We will illustrate by simulating binary measurements of two levels of quality:

  1. BrS (bronze-standard) measurements named MBS1 that do not have perfect sensitivity or specificity, and
  2. SS (silver standard) measurements named MSS1 that do not have perfect sensitivity but perfect specificity.

In Section 8, we describe some real-world examples of BrS and SS measurements. We will illustrate how to use one or both sources of measurements to fit models.

The following parameter values are set in the following code chunk:

  • the number of subjects in case/control group for each site Nd = Nu = 300;
  • the number of causes L.CAUSE = 6 (named A to F);
  • measurement source is a single “slice” of BrS data named MBS1 and a single slice of SS data named MSS1;
  • all causative agents are tested and stored in MBS1 (a total of J.BrS targeted agents are measured), while only agents A and B are tested and stored in MSS1 (a total of J.SS);
  • the number of subclasses for each latent classs K = 2
set.seed(20200603)
Nd         <- 300                       # number of case subjects
Nu         <- 300                       # number of control subjects
L.CAUSE    <- 6                         # number of causes
cause_list <- c(LETTERS[1:L.CAUSE])     # causes names; here we illustrate single-agent causes
J.BrS      <- 6                         # number of Bronze-Standard (BrS) measures
J.SS       <- 2                         # number of Silver-Standard (SS) measures
K          <- 2                         # number of subclasses

2.1 Simulate data without covariates

We will set the true positive rates (TPRs), false positive rates (FPRs) and subclass weights for generating measurement data. Here we simulate both bronze-standard (BrS) and silver-standard (SS) measurements; two sets of TPRs and FPRs are needed, one for BrS and the other for SS. In particular, we can use the following code to achieve this. In particular, we use the following setting

  • Two sets of true positie rates (TPR) and false positive rates (FPRs) are provided below.
set.seed(2)
# true FPR for BrS data, one set of FPR values per subclass     
PsiBS0   <- cbind(c(0.25,0.25,0.2,0.15,0.15,0.15), c(0.2,0.2,0.25,0.1,0.1,0.1))
# true TPR for BrS data, one set of TPR values per subclass
ThetaBS0 <- cbind(c(0.95,0.9,0.9,0.9,0.9,0.9), c(0.95,0.9,0.9,0.9,0.9,0.9))
# true TPR for SS data
ThetaSS_withNA <- c(0.15,0.1,NA,NA,NA,NA)
# true FPR for SS data
PsiSS_withNA <- c(0,0,NA,NA,NA,NA)

# true subclass weights 
lambda   <- c(0.5,0.5) 
eta      <- c(0,1)

set_parameter_noreg <- list(
  cause_list      = cause_list,
  etiology        = c(0.5,0.2,0.15,0.05,0.05,0.05), 
  pathogen_BrS    = LETTERS[1:L.CAUSE],
  pathogen_SS     = LETTERS[1:L.CAUSE][!is.na(ThetaSS_withNA)],
  meas_nm         = list(MBS = c("MBS1"),MSS=c("MSS1")),
  Lambda          = lambda,           
  Eta             = t(replicate(J.BrS,eta)), 
  PsiBS           = PsiBS0,
  ThetaBS         = ThetaBS0,
  PsiSS           = PsiSS_withNA[!is.na(PsiSS_withNA)], 
  ThetaSS         = ThetaSS_withNA[!is.na(ThetaSS_withNA)],
  Nd              =     Nd, 
  Nu              =     Nu 
)

simu_out_noreg   <- simulate_nplcm(set_parameter_noreg)
data_nplcm_noreg <- simu_out_noreg$data_nplcm

This data set has also been saved in the baker package and can be called by data(data_nplcm_noreg).

We can get a quick sense about the simulated case-control data. For BrS data:

summarize_BrS(data_nplcm_noreg$Mobs$MBS[[1]],data_nplcm_noreg$Y)
## $Nd
## [1] 300
## 
## $Nu
## [1] 300
## 
## $MBS_mean
##         [,1]    [,2]    [,3]    [,4]    [,5]    [,6]
## case    0.60 0.31333 0.37000 0.16333 0.11667 0.13667
## control 0.21 0.21667 0.23667 0.12000 0.13333 0.13000
## 
## $MBS_q1
##              [,1]      [,2]      [,3]       [,4]       [,5]       [,6]
## case    0.5436223 0.2634318 0.3173072 0.12559292 0.08477613 0.10209823
## control 0.1675776 0.1736517 0.1919746 0.08764294 0.09919185 0.09629289
## 
## $MBS_q2
##              [,1]     [,2]      [,3]      [,4]      [,5]      [,6]
## case    0.6538491 0.367955 0.4259800 0.2095867 0.1582502 0.1804223
## control 0.2597553 0.266846 0.2880173 0.1619657 0.1767463 0.1730629
## 
## $MBS_nm
## [1] "A" "B" "C" "D" "E" "F"

We can also get the top five (n_pat=5) most frequently observed multivariate binary patterns among the cases (case_status=1):

get_top_pattern(data_nplcm_noreg$Mobs$MBS$MBS1,data_nplcm_noreg$Y,case_status=1,n_pat=5)
## $obs_pat
##     100000     001000     101000     110000     010000      other 
## 0.22666667 0.08666667 0.08666667 0.06666667 0.06333333 0.47000000 
## 
## $pattern_names
## [1] "100000" "001000" "101000" "110000" "010000" "other" 
## 
## $exist_other
## [1] TRUE
## 
## $N
## [1] 300

We can also plot pairwise log odds ratios for pairs of BrS measurements in the cases and the controls, respectively.

plot_logORmat(data_nplcm_noreg,pathogen_display=set_parameter_noreg$pathogen_BrS)
## == Visualizing pairwise log odds ratios for bronze-standard data set:  1 :  MBS1 . ==

For SS data, we can summarize similarly by:

summarize_SS(data_nplcm_noreg$Mobs$MSS[[1]],data_nplcm_noreg$Y)
## $count
## [1] 28  7
## 
## $NA_count
## A B 
## 0 0 
## 
## $Nd
## [1] 300
## 
## $MSS_mean
##         [,1]    [,2]
## [1,] 0.09333 0.02333
## 
## $MSS_q1
##           [,1]       [,2]
## [1,] 0.0649724 0.01037832
## 
## $MSS_q2
##           [,1]       [,2]
## [1,] 0.1319772 0.04834131
## 
## $MSS_nm
## [1] "A" "B"

We can also show data for a subset of individuals:

data_nplcm_noreg$X$patid <- paste("PAT",1:(Nd+Nu),sep="")
data_nplcm_noreg$X <- as.data.frame(data_nplcm_noreg$X)
subset_data_nplcm_by_index(data_nplcm_noreg,which(data_nplcm_noreg$X$patid%in%c("PAT12","PAT408")))
## $Mobs
## $Mobs$MBS
## $Mobs$MBS$MBS1
##     A B C D E F
## 12  1 0 0 1 0 0
## 408 1 0 0 0 1 0
## 
## 
## $Mobs$MSS
## $Mobs$MSS$MSS1
##     A B
## 12  0 0
## 408 0 0
## 
## 
## 
## $Y
## [1] 1 0
## 
## $X
##      patid
## 12   PAT12
## 408 PAT408
data_nplcm_noreg$X <- NULL # remove X which was for illustration.

2.1.1 Store information about data inputs

So far, we have simulated a data set based on NPLCM without any covariates, with \(K=2\) subclasses for each disease class among the cases, and among the controls. We have simulated two types of measurements: BrS measurements and SS measurements. In this section, we introduce some simple functions to organize the meta-data about measurement names (specimen, test technology), measurement quality and cause list under consideration. First, we create a BrS_object_1 containing pertinent information about the BrS measurements. For example, the specimen name can be saved in the specimen argument, e.g., “NP” for a nasopharyngeal specimen (in data_nplcm_noreg, we used MBS1). Another piece of information can be saved in the test argument, e.g. “PCR” (polymerase chain reaction) - here we use test="1" for illustration to represent a diagnostic test technology that would produce data using the said specimen. The quality specifies the measurement quality (e.g. BrS or SS). Although the list of causes cause_list is not strictly measurement, it is inputs to help interpret data and is crucial in linking the targeted agents and the unobserved causes for interpreting the data.

BrS_object_1 <- make_meas_object(patho    = set_parameter_noreg$pathogen_BrS, 
                                 specimen = "MBS", test = "1", quality = "BrS",  
                                 cause_list = set_parameter_noreg$cause_list)
BrS_object_1
## $quality
## [1] "BrS"
## 
## $patho
## [1] "A" "B" "C" "D" "E" "F"
## 
## $name_in_data
## [1] "A_MBS1" "B_MBS1" "C_MBS1" "D_MBS1" "E_MBS1" "F_MBS1"
## 
## $template
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    0    0
## [5,]    0    0    0    0    1    0
## [6,]    0    0    0    0    0    1
## [7,]    0    0    0    0    0    0
## 
## $specimen
## [1] "MBS"
## 
## $test
## [1] "1"
## 
## $nm_spec_test
## [1] "MBS1"

make_meas_object() creates a “template” for specific measurement types. This stores the information about a particular type of measurement in a list. This function can be useful when the raw data set is tabular with multiple measurement types in distinct blocks of columns (unlike data_nplcm_noreg which is a list). To allow multiple measurements to jointly inform latent status, we create a cause-targeted-agent template for each measurement pair of (specimen, test), e.g., nasal-pharyngeal (NP) swab specimens tested by PCR technology) to indicate how to connect measurements to latent status.

The element template is created by function make_template() (see Section 2.1.1.1). It is a mapping matrix from each of length(cause_list) causes to each of the J measured binary responses. The final row represent controls without the disease under consideration; the other rows represent pre-specified disease classes with distinct causative agents in the case population with the disease. This matrix is crucial in model fitting to determine which measurements are informative of a particular category of latent status. In the example above, the first row is representing a case whose disease was caused by agent A alone. In this simplified setting, A to F are directly measured. So among these cases, the first element 1 represents that the measurement of A will be positive with a TPR for A, which might differ from a reference level among the controls (FPR); the second element 0 means that the measurement of B will be positive with the same rate as in the controls (similarly for the third to the sixth element). Similarly for the second to the sixth row, they represent cases caused by agent B to F. In the final all-zero row, we use it to represent controls, because each of the measurements of A to F will be positive with probabilities equal to the reference levels in the controls (FPRs). At a deeper level, such template information is needed to utilize the visualization functions in baker, because the observed positive rates of an error-prone measurement depends on how the latent disease class is connected to the measurement.

These functions are built into the nplcm() and is automatically called when cause_list and the data_nplcm are supplied.

Second, we can similarly organize information about SS measurements as follows:

SS_object_1 <- make_meas_object(patho=LETTERS[1:J.SS],"MSS","1","SS",cause_list)

We can then record the data cleaning options:

clean_options <- list(BrS_objects = make_list(BrS_object_1), SS_objects  = make_list(SS_object_1))

Here, the clean_options list is to organize these data so can be quickly called and checked during data analyses.

2.1.1.1 make_template()

We provide a more general example of cause-targeted-agent mapping matrix. Here, we consider a scenario where causes assumed_causes can be singleton (A,X_B,Y_B,C), sub-types (B1,B2) of a measured agent (B), doubleton (A+C,B+C), or not among the measured agents (other).

The agents that are measured by a paritcular (specimen,test) may be different from the causes. In the following, we also named the rows of the template for easier exposition. The code below achieves this specification of causes and measured agents.

assumed_causes <- c("A","B1","B2","X_B","Y_B","C","A+C","B+C","other")
targeted_agents <- c("A","B","C","X_B","Y_B")
template <- make_template(patho = targeted_agents, cause_list = assumed_causes)
rownames(template) <- c(assumed_causes,"control")
colnames(template) <- targeted_agents
print(template)
##         A B C X_B Y_B
## A       1 0 0   0   0
## B1      0 1 0   0   0
## B2      0 1 0   0   0
## X_B     0 0 0   1   0
## Y_B     0 0 0   0   1
## C       0 0 1   0   0
## A+C     1 0 1   0   0
## B+C     0 1 1   0   0
## other   0 0 0   0   0
## control 0 0 0   0   0

A value of 1 indicates that we need to use true positive rates to describe the observed measurement on a targeted agent for cases caused by the latent status belonging to that row. A value of 0 indicates that we need to use false positive rates estimated from control population, whose templates for these measurements have all zero entries. There are many possible extensions of this framework. For example, if a measurement D exists that can inform all bacterial latent status, then bacterial latent status rows in column D will all be ones in the template. In a particular analysis with multiple pairs of (specimen, test), the meaning and the names of the columns in the template may vary by pairs of (specimen, test). Importantly, the meanings and the numbers of rows in this template remain the same regardless of which (specimen,test) was used to detect unobserved/latent causes, which represent disease classes that must be a shared construct.

2.2 Simulate data with covariates

2.2.1 With one discrete covariate

To simulate data with a discrete covariate (e.g., SITE), we will modify the parameter values
so that different strata of the covariate may have distinct vectors of cause-specific case fractions (CSCFs).
For example, this is designed to reflect the possibility that CSCF could vary by study site.

For simpler exposition, we use two levels and refer to the covariate as SITE with N.SITE = 2 levels. In addition, we let population-level etiologies (cause-specific case fractions, or CSCFs) vary by SITE, which is specified in etiology_allsites

N.SITE     <- 2                         # number of sites
N          <- N.SITE*(Nd+Nu)            # total number of subjects

# etiology for all sites; true values for each site/pathogen combo:
etiology_allsites <- list(c(0.5,0.2,0.15,0.05,0.05,0.05),
                          c(0.2,0.5,0.15,0.05,0.05,0.05))

In the following, we will not let the FPR vary by SITE, although this can be readily achieved. We also simulate both BrS and SS data to demonstrate the versatility of the function simulate_nplcm(). The model fitting may use BrS measurements, SS measurements, or both (as in the below example) as data inputs.

out_list <- lapply(1:N.SITE,function(siteID){ # each site has Nd cases, Nu controls.
  set_parameter <- list(    # list of the parameters that can be specified 
    cause_list   = cause_list,
    etiology     = etiology_allsites[[siteID]], 
    pathogen_BrS = LETTERS[1:J.BrS], 
    SS           = TRUE,    # simulate SS information 
    pathogen_SS  = LETTERS[1:2], # the pathogens for which we have the SS information
    meas_nm      = list(MBS = c("MBS1"),MSS=c("MSS1")),
    Lambda       = lambda,  # control subclass weight for BrS   
    Eta          = t(replicate(J.BrS,eta)),  # case subclass weight for BrS
    PsiBS        = PsiBS0,  # FPR for bronze-standard measures
    PsiSS        = cbind(rep(0,J.BrS),rep(0,J.BrS)), # FPR for SS measures
    ThetaBS      = ThetaBS0, # TPR for bronze-standard measures
    ThetaSS      = cbind(c(0.25,0.10,0.15,0.05,0.15,0.15), # TPRs for MSS1
                         c(0.25,0.10,0.15,0.05,0.15,0.15)),
    Nd = Nd, 
    Nu = Nu 
  )
  out     <- simulate_nplcm(set_parameter) 
  res   <- out$data_nplcm 
  res$X <- data.frame(SITE=rep(siteID,(set_parameter$Nd+set_parameter$Nu))) # covariate
  return(res)
})

data_nplcm_unordered  <- combine_data_nplcm(out_list) 

After combining the data across the two strata, we need to reorder subjects to place the case subjects on top of the control subjects, as required by the model fitting function nplcm(). The subset_data_nplcm_by_index() function makes this easy to do.

data_nplcm_reg_nest_strat <- subset_data_nplcm_by_index(data_nplcm_unordered,
                                                        order(-data_nplcm_unordered$Y))

2.2.2 With mixed discrete and continuous covariates

We can also use the baker package to generate data under NPLCM with regression. We need to specify the CSCF and subclass weight generating processes for simulating this data. We load the simulated data as follows:

data(data_nplcm_reg_nest)

The R code that generated this data can be found in /inst/example/simulate_data_example_nplcm_reg_nest.R of the package file. We omit the details here for simplicity.

3 Model specification

Various NPLCM can be specified by organizing model likelihood, prior, and input data qualities into model_options, which includes the following three aspects:

  1. use_measurements:
    • can be BrS or SS or c("BrS","SS") to represent the quality of the data sources
  2. likelihood:
    • k_subclass indicates whether or not to use the conditional independence model (k=1 corresponds to the model with conditional independence given a cause, referred to as “non-nested model”)
    • Regression formulas
      • Eti_formula: regression formula for relating the CSCFs to case covariates
      • FPR_formula: regression formula for relating the subclass weights to covariates (must be common to case and control subjects)
  3. prior:
    • Eti_prior:
      • if no regression (nested K>1 or non-nested K=1): specify a numeric vector of length equal to the number of causes; this vector are Dirichlet hyperparameter for the population etiologies
      • if doing regression:
      1. if non-nested K=1
        • all covariates are discrete: specify a numeric matrix with (#rows, #columns) = (#strata for etiology, #causes). Etiology prior for each stratum; each row is a vector of Dirichlet hyperparameters for the population etiologies in a stratum defined by the discrete covariates
        • if having continuous covariates: specify Eti_prior = c(2,2) for regression coefficients for continuous and non-continuous covariates
      2. if nested K>1
        • specify Eti_prior = c(2,2) for regression coefficients for continuous and non-continuous covariates
    • TPR_prior: informative priors for the true positive rates; for example, we can specify a range 0.5 - 0.99 for MBS1 measurement and range 0.1 - 0.3 for MSS1 measurement when using the simulated data so far;
    • Other priors that have been set to defaults include the smoothness selection hyperparamters for the case and control FPRs (usually taken to be non-informative, given that FPRs can be estimated from the data).

We illustrate the specification of NPLCM via a few examples.

3.1 Model options for NPLCM without regression

First, we specify model options for fitting an NPLCM without regression and using only BrS data. In this example, we are setting the TPR prior to be informative as a plausible range of values.

model_options_no_reg <- list(
  use_measurements = c("BrS"),  
  likelihood   = list(
    cause_list = cause_list,
    k_subclass = 2,
    Eti_formula = ~-1, # no etiology regression
    FPR_formula = list(
      MBS1 =   ~-1)    # no the subclass weight regression
  ),
  prior= list(
    Eti_prior = overall_uniform(1,cause_list), 
    TPR_prior  = list(BrS = list(
      info  = "informative", 
      # informative prior for TPRs
      input = "match_range", 
      # specify the informative prior for TPRs by specifying a plausible range.
      val = list(MBS1 = list(up =  list(rep(0.99,J.BrS)), 
                             # upper ranges: matched to 97.5% quantile of a Beta prior
                             low = list(rep(0.55,J.BrS))))
      # lower ranges: matched to 2.5% quantile of a Beta prior
    ))))     

assign_model(model_options_no_reg,data_nplcm_noreg)
## $num_slice
## MBS MSS MGS 
##   1   0   0 
## 
## $nested
## [1] TRUE
## 
## $regression
## $regression$do_reg_Eti
## [1] FALSE
## 
## $regression$do_reg_FPR
##  MBS1 
## FALSE 
## 
## $regression$is_discrete_predictor
## $regression$is_discrete_predictor$Eti
## [1] FALSE
## 
## $regression$is_discrete_predictor$FPR
##  MBS1 
## FALSE 
## 
## 
## 
## $BrS_grp
## [1] FALSE
## 
## $SS_grp
## [1] FALSE

Second, we specify a model that uses both BrS and SS data to fit an NPLCM without regression. To do this, we just need to modify the measurements argument and specify a TPR prior for the SS data. This flexibility shows the package can work with different sources of data.

model_options_no_reg_with_SS <- model_options_no_reg
model_options_no_reg_with_SS$use_measurements = c("BrS","SS")
model_options_no_reg_with_SS$prior$TPR_prior$SS  = 
  list(info  = "informative", input = "match_range",
        val   = list(MSS1 = list(up = list(rep(0.5,length(SS_object_1$patho))),
                               low = list(rep(0.01,length(SS_object_1$patho))))))

3.2 Model options for NPLCM with regression

3.2.1 Non-nested models (K=1): discrete covariate only

In this example, we set the model fitting options for a non-nested PLCM with one discrete covariate. Note that the dimension of the etiology prior here (a 2 by 6 matrix) has changed from model_options_no_reg in the previous example (a vector of length 6).

# copy the model options
model_options_reg_nonest_strat <-  list(
  use_measurements = c("BrS"), 
  likelihood  = list(
    cause_list = cause_list,
    k_subclass = 1,
    Eti_formula = ~ -1+as.factor(SITE), # etiology regression
    FPR_formula = list(MBS1 =  ~ -1 + as.factor(SITE))), # subclass weight regression.
  prior= list(
    Eti_prior = matrix(1,nrow=N.SITE,ncol=length(cause_list)), # prior needed for each stratum.
    TPR_prior  = list(BrS = list(
      info  = "informative", 
      input = "match_range", 
      val = list(MBS1 = list(up =  list(rep(0.99,J.BrS)),
                             low = list(rep(0.55,J.BrS))))
    ))))

3.2.2 Nested models (K>1): discrete covariate only

To fit an NPLCM with the same data as in Section 3.2.1, we just need to change k_subclass argument and change the etiology prior specification Eti_prior as follows. The difference in the Eti_prior specification reflects different parameterizations in the nested (multinomial regression for etiology) and non-nested (stratum-specific etiology, no multinomial regression) models when dealing with discrete covariates.

model_options_reg_nest_strat <- model_options_reg_nonest_strat
model_options_reg_nest_strat$likelihood$k_subclass <- K
model_options_reg_nest_strat$prior$Eti_prior <- c(2,2)

3.2.3 Non-nested models (K=1): mixed discrete and continuous covariates

In this example, we fit a non-nested PLCM with multiple covariates: SITE which is discrete, and DATE which is continuous. In the case of covariates that are formatted as dates in R, baker provides the functions s_date_Eti and s_date_FPR in order to create the design matrices. Users can choose the basis expansion (e.g. ps for penalized-splines) and the degrees of freedom (dof); options such as ncs is also available for natual cubic splines. In regression formula, other basis functions can also be used.

model_options_reg_nonest <- model_options_reg_nest_strat
model_options_reg_nonest$likelihood$k_subclass <- 1
model_options_reg_nonest$likelihood$Eti_formula = ~ -1+s_date_Eti(DATE,Y,basis='ps',dof=7)+as.factor(SITE)
model_options_reg_nonest$likelihood$FPR_formula = list(MBS1 =  ~ -1 +s_date_FPR(DATE,Y,basis = "ps",dof=5) + as.factor(SITE))

3.2.4 Nested models (K>1): mixed discrete and continuous covariates

To fit the same model as above, but now with subclasses, all we need to do is modify the subclass argument to be the number of desired subclasses.

model_options_reg_nest <- model_options_reg_nonest
model_options_reg_nest$likelihood$k_subclass <- K

4 Model fitting by MCMC

Because we use JAGS for Bayesian inference of the models, we need to provide .bug files with model likelihood and prior. The package baker can interpret the specified model options and write them in .bug files. In this case, baker uses some internal function to generate a.bug model file for conditional independence models with regression (discrete covariates), accommodating multiple bronze-standard and silver-standard data. This model file will be stored in the path specified by result.folder.

# create a directory for testing code (LOCAL, temporary folder):
working_dir <- "~/Downloads/" #tempdir() #
#can change to your local folder if needed; tempdir avoids hard coded paths.
Date        <- gsub("-", "_", Sys.Date())

In addition, in the demonstrations, we have shortened the iterations to produce results within a reasonable amount of time; in practice, we recommend longer chains and conducting convergence diagnostics when specifying the number of iterations.

4.1 MCMC options for NPLCM without regression

First, we fit an NPLCM using only BrS data.

set.seed(1)
# include stratification information in file name:
thedir_dated    <- paste0(working_dir,Date,"_no_reg")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))


# create folders to store the model results 
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

# place the nplcm data and cleaning options into the results folder 
dput(data_nplcm_noreg,file.path(thedir,"data_nplcm.txt"))   
dput(clean_options, file.path(thedir,"data_clean_options.txt"))


# options for MCMC chains:
mcmc_options_no_reg <- list(
  n.chains = 1,
  n.itermcmc = as.integer(2000), 
  n.burnin = as.integer(1000), 
  n.thin = 1,
  individual.pred = TRUE, 
  ppd = TRUE,
  result.folder = thedir,
  bugsmodel.dir = thedir
)

nplcm_noreg <- nplcm(data_nplcm_noreg,
                     model_options_no_reg,mcmc_options_no_reg)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_no_reg/results
# just to reduce the number of iterations for model_options in the following (reduce 
# the time to run this reproducible file); in practice, these numbers need to be increased.
mcmc_options_no_reg$n.itermcmc <- 200
mcmc_options_no_reg$n.burnin <- 100

4.1.1 NPLCM without regression (BrS and SS data)

Second, we fit an NPLCM using both BrS and SS data.

set.seed(1)
# include stratification information in file name:
thedir_dated    <- paste0(working_dir,Date,"_no_reg_with_SS")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))

# create folders to store the model results 
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

# options for MCMC chains:
mcmc_options_no_reg_with_SS <- mcmc_options_no_reg
mcmc_options_no_reg_with_SS$result.folder <- thedir
mcmc_options_no_reg_with_SS$bugsmodel.dir <- thedir

# place the nplcm data and cleaning options into the results folder
dput(data_nplcm_noreg,file.path(thedir,"data_nplcm.txt")) 
dput(clean_options, file.path(thedir,"data_clean_options.txt"))

rjags::load.module("glm")
## module glm loaded
nplcm_noreg_with_SS <- nplcm(data_nplcm_noreg,
                             model_options_no_reg_with_SS,mcmc_options_no_reg_with_SS)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_no_reg_with_SS/results

4.2 MCMC options for NPLCM with regression

4.2.1 Non-nested models (K=1): discrete covariates only

set.seed(2)
# create folders to store the MCMC output and model results 
thedir_dated   <- paste0(working_dir,Date,"_reg_nonest_strat")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

# set mcmc options 
mcmc_options_reg_nonest_strat <-  mcmc_options_no_reg
mcmc_options_reg_nonest_strat$get.pEti = TRUE
mcmc_options_reg_nonest_strat$result.folder = thedir
mcmc_options_reg_nonest_strat$bugsmodel.dir = thedir

## put the simulated data into the results folder 
dput(clean_options, file.path(thedir, "data_clean_options.txt"))
dput(data_nplcm_reg_nest_strat,file.path(thedir,"data_nplcm.txt")) 

nplcm_reg_nonest_strat <- nplcm(data_nplcm_reg_nest_strat,
                                model_options_reg_nonest_strat,mcmc_options_reg_nonest_strat)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_reg_nonest_strat/results

4.2.2 Nested models (K>1): discrete covariates only

set.seed(2)
# create folders to store the MCMC output and model results 
thedir_dated   <- paste0(working_dir,Date,"_reg_nest_strat")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

# set mcmc options 
mcmc_options_reg_nest_strat <-  mcmc_options_reg_nonest_strat
mcmc_options_reg_nest_strat$result.folder = thedir
mcmc_options_reg_nest_strat$bugsmodel.dir = thedir

## put the simulated data into the results folder 
dput(clean_options, file.path(thedir, "data_clean_options.txt"))
dput(data_nplcm_reg_nest_strat,file.path(thedir,"data_nplcm.txt")) 

nplcm_reg_nest_strat <- nplcm(data_nplcm_reg_nest_strat,
                              model_options_reg_nest_strat,mcmc_options_reg_nest_strat)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_reg_nest_strat/results

4.2.3 NPLCM regression with K=1: mixed discrete and continuous covariates

set.seed(3)
thedir_dated  <- paste0(working_dir,Date,"_reg_nonest")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

## set mcmc options
mcmc_options_reg_nonest <- mcmc_options_reg_nest_strat
mcmc_options_reg_nonest$get.pEti = FALSE
mcmc_options_reg_nonest$result.folder = thedir
mcmc_options_reg_nonest$bugsmodel.dir = thedir

# place the nplcm data and cleaning options into the results folder
dput(clean_options, file.path(thedir, "data_clean_options.txt"))
dput(data_nplcm_reg_nest,file.path(thedir,"data_nplcm.txt")) 

nplcm_reg_nonest <- nplcm(data_nplcm_reg_nest,
                          model_options_reg_nonest,mcmc_options_reg_nonest)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_reg_nonest/results

4.2.4 NPLCM with regression nested model (K>1): mixed discrete and continuous covariates

set.seed(3)
thedir_dated  <- paste0(working_dir,Date,"_reg_nest")
thedir <- file.path(thedir_dated,paste("results",collapse="_"))
dir.create(thedir_dated, showWarnings = FALSE)
dir.create(thedir, showWarnings = FALSE)

## set mcmc options
mcmc_options_reg_nest <- mcmc_options_reg_nonest
mcmc_options_reg_nest$result.folder = thedir
mcmc_options_reg_nest$bugsmodel.dir = thedir

# place the nplcm data and cleaning options into the results folder
dput(clean_options, file.path(thedir, "data_clean_options.txt"))
dput(data_nplcm_reg_nest,file.path(thedir,"data_nplcm.txt")) 

nplcm_reg_nest <- nplcm(data_nplcm_reg_nest,
                        model_options_reg_nest,mcmc_options_reg_nest)
## ==[baker] Results stored in: == 
##  ~/Downloads/2022_01_21_reg_nest/results

5 Record analysis results

To perform diagnostics and create some visualizations of the model estimates from the data set, we can save our data set and measurement information in the same folder as the posterior samples. An example folder is provided at the link here; it is not included in the package itself because of the large files sizes.For example, in the model fitting stored in nplcm_noreg$result_folder_no_reg, it contains multiple files:

library(fs)
fs::dir_tree(path = nplcm_noreg$DIR_NPLCM, recurse = TRUE)
## ~/Downloads/2022_01_21_no_reg/results
## ├── CODAchain1.txt
## ├── CODAindex.txt
## ├── data_clean_options.txt
## ├── data_nplcm.txt
## ├── jagsdata.txt
## ├── jagsinits1.txt
## ├── jagsscript.txt
## ├── mcmc_options.txt
## ├── model_NoReg.bug
## └── model_options.txt

It is designed such that intermediate model results, model specifications, input data are retained for debugging and re-purposing for analyses not included in baker, such as post-stratification of CSCFs by discrete covarates (e.g., age group) using model results obtained from an NPLCM fitted without covariates. In addition, in high-performance computing, we may organize simulation settings by folder with proper names indicating the differences in the ground truth. Separate functions can be written and applied to these folders to obtain simulation results for various comparisons. Although the downside is the extra storage of results in the folder, but we believe in a complex substantive application, retaining these information is often more beneficial at the cost of additional disk memory. Fortunately, generic functions in the baker package can read and organize these information if the fitted object is provided.

6 Convergence and model diagnostics

6.1 Convergence diagnostics

After running a model, we can inspect the files in the folder that contains recoded information for the model results. The file jagsdata.txt contains all data and parameters used when fitting the model, which is saved in bugs.dat as follows. The files CODAchain1.txt and CODAindex.txt together record the posterior samples. Here we illustrate by using pEti, which stores the posterior samples of the CSCFs; we illustrate by focusing on model results obtained from an NPLCM without regression nplcm_noreg.

The posterior samples can be extracted by coda::read.coda and saved as an mcmc object.

res_nplcm_noreg <- coda::read.coda(file.path(nplcm_noreg$DIR_NPLCM,"CODAchain1.txt"),                          file.path(nplcm_noreg$DIR_NPLCM,"CODAindex.txt"),quiet=TRUE)

get_res   <- function(x,res) res[,grep(x,colnames(res))]

res <- get_res("pEti",res_nplcm_noreg)

We can then inspect the traceplots of the posterior sampling chains:

ggmcmc::ggs_traceplot(ggmcmc::ggs(res))

#ggmcmc::ggs_caterpillar(ggmcmc::ggs(res))

We can use powerful functions from coda to directly investigate these posterior sampling chains:

#R2WinBUGS:::plot.bugs(nplcm_noreg) # all parameters

# coda::gelman.diag(mcmc.list(res))
coda::raftery.diag(mcmc.list(res))
## [[1]]
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
## 
## You need a sample size of at least 3746 with these values of q, r and s
coda::effectiveSize(mcmc.list(res))
##  pEti[1]  pEti[2]  pEti[3]  pEti[4]  pEti[5]  pEti[6] 
## 151.0994 117.9783 149.3861 136.1919 144.2959 174.1503

6.2 Posterior predictive checking

We provide two useful posterior predictive checking functions:

  1. Standardized log odds ratio differences (SLORD); this is based on pairwise associations among the BrS measurements - near zero and non-significant SLORDs indicate the association is adequately characterized by the model.

  2. Probabilities of multivariate binary patterns with the highest frequencies; this is based on all the dimensions of measurements, not just pairwise, hence providing additional capability to assess the adequacy of the model in capturing the observed frequencies of multiple binary patterns.

6.2.1 Pairwise log-odds ratios

  1. plot_check_pairwise_SLORD(): a figure of posterior predicted pairwise log odds ratio (LOR) compared with the observed LOR for the BrS data. The numbers are (predicted LOR - observed LOR)/ s.e. (posterior predictive distribution of LOR); The closer to zero the better. The posterior predicted pairwise LOR can be derived as follows. At each MCMC iteration, we generate a new data set based on the model and parameter values at that iteration, of which pairwise LOR can be calculated. The sample size of the new data set equals that of the actual data set, i.e. the same number of cases and controls.

We note that the following figure may not be exactly the same as the one shown in the manuscript. This is due to the randomness in the JAGS algorithm, which we are not able to reproduce every time.

plot_check_pairwise_SLORD(nplcm_noreg$DIR_NPLCM,slice=1)

## ==[baker]A figure is generated for model checking: pairwise standardized log odds ratio difference (SLORD). 
##           Stored in  ~/Downloads/2022_01_21_no_reg/results  ==

6.2.2 Top observed patterns

plot_check_common_pattern(list(nplcm_noreg$DIR_NPLCM),slice_vec =1,n_pat = 5)
## ==Plotting for model checking: frequent BrS measurements patterns. ==

7 Summary and plotting

7.1 Generic summary() and print()

baker also provides generic functions to print out common summaries of fitted model information. We will supplement these summary information with powerful visualizations to be introduced afterwards.

## no reg
summary(nplcm_noreg) 
## [baker] summary: model structure 
##            fitted type:  no_reg 
## ---
##      name measurements:  MBS MSS MGS 
## slices of measurements:  1 0 0 
##                 nested:  TRUE 
## ---
##             regression:  
##                   etiology:  FALSE 
##                   name FPR:  MBS1 
##                        FPR:  FALSE 
## ---
## all discrete predictor:  
##                   etiology:  FALSE 
##                   name FPR:  MBS1 
##                        FPR:  FALSE 
## 
## ------- posterior summary -----------
##    post.mean    post.sd      CrI_025   CrI_0975
## A 0.51898667 0.04321884 0.4389885500 0.60610560
## B 0.14547767 0.03317947 0.0878256300 0.21666715
## C 0.19219672 0.03751189 0.1257157750 0.27151822
## D 0.07235960 0.02510018 0.0284053900 0.12818678
## E 0.01885477 0.01378702 0.0007806531 0.05181332
## F 0.05212458 0.02297974 0.0147645750 0.10357957
## reg_nonest_strat
summary(nplcm_reg_nonest_strat)
## $res_list
## $res_list$`Posterior distributions of CSCFs for stratum: 1; weight: 0.5`
##   cause   eti_mean      ci_025     ci_975
## 1     A 0.50185524 0.440342625 0.59569327
## 2     B 0.15892886 0.095557432 0.22734682
## 3     C 0.18220201 0.115159125 0.25592510
## 4     D 0.04231605 0.008237625 0.10825852
## 5     E 0.04228591 0.010408390 0.08772777
## 6     F 0.07241196 0.036107700 0.13201647
## 
## $res_list$`Posterior distributions of CSCFs for stratum: 2; weight: 0.5`
##   cause   eti_mean       ci_025     ci_975
## 1     A 0.23979061 0.1711388500 0.31120415
## 2     B 0.50427492 0.4202552000 0.57224760
## 3     C 0.13946342 0.0746520800 0.21254932
## 4     D 0.06848787 0.0335582850 0.11721230
## 5     E 0.03167196 0.0076331873 0.06627329
## 6     F 0.01631116 0.0004152776 0.04601204
## 
## $res_list$`Posterior distributions of CSCFs (across all levels using weights: (0.5,0.5))`
##   cause   eti_mean     ci_025     ci_975
## 1     A 0.37082293 0.32792886 0.41845556
## 2     B 0.33160189 0.29058285 0.39370026
## 3     C 0.16083272 0.11903049 0.20624271
## 4     D 0.05540196 0.02654229 0.09663563
## 5     E 0.03697893 0.01045470 0.06822240
## 6     F 0.04436156 0.02063413 0.07944312
## 
## 
## $parsed_model
## $parsed_model$num_slice
## MBS MSS MGS 
##   1   0   0 
## 
## $parsed_model$nested
## [1] FALSE
## 
## $parsed_model$regression
## $parsed_model$regression$do_reg_Eti
## [1] TRUE
## 
## $parsed_model$regression$do_reg_FPR
## MBS1 
## TRUE 
## 
## $parsed_model$regression$is_discrete_predictor
## $parsed_model$regression$is_discrete_predictor$Eti
## [1] TRUE
## 
## $parsed_model$regression$is_discrete_predictor$FPR
## MBS1 
## TRUE 
## 
## 
## 
## $parsed_model$BrS_grp
## [1] FALSE
## 
## $parsed_model$SS_grp
## [1] FALSE
## 
## 
## $unique_Eti_level
##   as.factor(SITE)1 as.factor(SITE)2
## 1                1                0
## 2                0                1
## 
## $fitted_type
## [1] "reg_nonest_strat"
## 
## attr(,"class")
## [1] "summary.nplcm.reg_nonest_strat"
## reg_nonest
a <- 1
DISCRETE_BOOL      <- data_nplcm_reg_nest$X$SITE==a

summary(nplcm_reg_nonest,stratum_bool=DISCRETE_BOOL)
## $Eti_overall_mean
## [1] 0.391914071 0.124167484 0.003087651 0.002379502 0.002144173 0.476307118
## 
## $Eti_overall_q
##            [,1]       [,2]         [,3]         [,4]         [,5]      [,6]
## 2.5%  0.2985352 0.07156935 0.0001593436 2.804072e-05 1.200121e-05 0.3536714
## 97.5% 0.5004845 0.19831142 0.0116159207 1.130913e-02 1.001583e-02 0.5635828
## 
## $Eti_overall_sd
## [1] 0.054386187 0.033364543 0.003794579 0.003168325 0.002941558 0.055909009
## 
## $res
##     post.mean     post.sd      CrI_025   CrI_0975
## A 0.391914071 0.054386187 2.985352e-01 0.50048452
## B 0.124167484 0.033364543 7.156935e-02 0.19831142
## C 0.003087651 0.003794579 1.593436e-04 0.01161592
## D 0.002379502 0.003168325 2.804072e-05 0.01130913
## E 0.002144173 0.002941558 1.200121e-05 0.01001583
## F 0.476307118 0.055909009 3.536714e-01 0.56358276
## 
## $parsed_model
## $parsed_model$num_slice
## MBS MSS MGS 
##   1   0   0 
## 
## $parsed_model$nested
## [1] FALSE
## 
## $parsed_model$regression
## $parsed_model$regression$do_reg_Eti
## [1] TRUE
## 
## $parsed_model$regression$do_reg_FPR
## MBS1 
## TRUE 
## 
## $parsed_model$regression$is_discrete_predictor
## $parsed_model$regression$is_discrete_predictor$Eti
## [1] FALSE
## 
## $parsed_model$regression$is_discrete_predictor$FPR
##  MBS1 
## FALSE 
## 
## 
## 
## $parsed_model$BrS_grp
## [1] FALSE
## 
## $parsed_model$SS_grp
## [1] FALSE
## 
## 
## $fitted_type
## [1] "reg_nonest"
## 
## attr(,"class")
## [1] "summary.nplcm.reg_nonest"
## reg_nest_strat
summary(nplcm_reg_nest_strat)
## $res_list
## $res_list$`Posterior distributions of CSCFs for stratum: 1; weight: 0.5`
##   cause   eti_mean      ci_025     ci_975
## 1     A 0.58066822 0.512700462 0.68608840
## 2     B 0.13915084 0.081865703 0.20591399
## 3     C 0.15401646 0.090267745 0.23170060
## 4     D 0.02659403 0.002771325 0.07367273
## 5     E 0.03809814 0.004631847 0.06990748
## 6     F 0.06147232 0.027172463 0.13066090
## 
## $res_list$`Posterior distributions of CSCFs for stratum: 2; weight: 0.5`
##   cause   eti_mean      ci_025     ci_975
## 1     A 0.28753164 0.210604394 0.35781937
## 2     B 0.47428487 0.402480217 0.55115488
## 3     C 0.14573558 0.098145528 0.19459967
## 4     D 0.05550620 0.022478555 0.09500001
## 5     E 0.01838353 0.001218899 0.06304055
## 6     F 0.01855818 0.001707584 0.06547970
## 
## $res_list$`Posterior distributions of CSCFs (across all levels using weights: (0.5,0.5))`
##   cause   eti_mean     ci_025     ci_975
## 1     A 0.43409993 0.37191540 0.50704648
## 2     B 0.30671786 0.25649708 0.35615032
## 3     C 0.14987602 0.10439840 0.20260803
## 4     D 0.04105012 0.01353969 0.06705656
## 5     E 0.02824083 0.01107007 0.04841912
## 6     F 0.04001525 0.01776426 0.07365346
## 
## 
## $parsed_model
## $parsed_model$num_slice
## MBS MSS MGS 
##   1   0   0 
## 
## $parsed_model$nested
## [1] TRUE
## 
## $parsed_model$regression
## $parsed_model$regression$do_reg_Eti
## [1] TRUE
## 
## $parsed_model$regression$do_reg_FPR
## MBS1 
## TRUE 
## 
## $parsed_model$regression$is_discrete_predictor
## $parsed_model$regression$is_discrete_predictor$Eti
## [1] TRUE
## 
## $parsed_model$regression$is_discrete_predictor$FPR
## MBS1 
## TRUE 
## 
## 
## 
## $parsed_model$BrS_grp
## [1] FALSE
## 
## $parsed_model$SS_grp
## [1] FALSE
## 
## 
## $unique_Eti_level
##   as.factor(SITE)1 as.factor(SITE)2
## 1                1                0
## 2                0                1
## 
## $fitted_type
## [1] "reg_nest_strat"
## 
## attr(,"class")
## [1] "summary.nplcm.reg_nest_strat"
## reg_nest
summary(nplcm_reg_nest,stratum_bool=DISCRETE_BOOL)
## $Eti_overall_mean
## [1] 0.48419620 0.15133910 0.19383522 0.09163163 0.03636304 0.04263481
## 
## $Eti_overall_q
##            [,1]       [,2]      [,3]       [,4]        [,5]       [,6]
## 2.5%  0.4064478 0.09881242 0.1300929 0.04578816 0.003213423 0.01250589
## 97.5% 0.5439524 0.21940627 0.2531137 0.14766931 0.078201451 0.07593862
## 
## $Eti_overall_sd
## [1] 0.03981848 0.03189680 0.03242157 0.02548509 0.01882144 0.01712138
## 
## $res
##    post.mean    post.sd     CrI_025   CrI_0975
## A 0.48419620 0.03981848 0.406447824 0.54395241
## B 0.15133910 0.03189680 0.098812424 0.21940627
## C 0.19383522 0.03242157 0.130092931 0.25311366
## D 0.09163163 0.02548509 0.045788159 0.14766931
## E 0.03636304 0.01882144 0.003213423 0.07820145
## F 0.04263481 0.01712138 0.012505893 0.07593862
## 
## $parsed_model
## $parsed_model$num_slice
## MBS MSS MGS 
##   1   0   0 
## 
## $parsed_model$nested
## [1] TRUE
## 
## $parsed_model$regression
## $parsed_model$regression$do_reg_Eti
## [1] TRUE
## 
## $parsed_model$regression$do_reg_FPR
## MBS1 
## TRUE 
## 
## $parsed_model$regression$is_discrete_predictor
## $parsed_model$regression$is_discrete_predictor$Eti
## [1] FALSE
## 
## $parsed_model$regression$is_discrete_predictor$FPR
##  MBS1 
## FALSE 
## 
## 
## 
## $parsed_model$BrS_grp
## [1] FALSE
## 
## $parsed_model$SS_grp
## [1] FALSE
## 
## 
## $fitted_type
## [1] "reg_nest"
## 
## attr(,"class")
## [1] "summary.nplcm.reg_nest"

7.2 Visualizations via generic plot()

7.2.1 NPLCM without regression (BrS and SS measurements)

# plot data, prior, and posterior for all causes
plot(nplcm_noreg_with_SS, bg_color = NULL)
## 
##  == Plotting Panels of Measurements and Marginal Posterior of Etiology Fractions ==
## 
##  == Plotting BrS Slice:  1 :  MBS1
## 
##  == Plotting SS Slice:  1 :  MSS1
## 
##  == Plotting pies ==

## 
##  == Done. ==

7.2.2 NPLCM regression

7.2.2.1 Nested models (K>1): discrete covariates only

In the case of an NPLCM with one or more discrete covariates, we can use plot(), which plots the posterior distributions of CSCFs for each covariate stratum. We can overlay the true parameter values if supplied; this is only possible in simulation studies. For real data analysis, we simply plot the estimates and associated uncertainty bands. The following code produces a figure for each stratum (the final row is for the overall estimates as a weighted average across strata). Alternatively, setting show_levels=0 only plots the posterior distributions for each CSCF averaged over all strata using empirical weights of the discrete covariate; alternative weights can also be supplied to produce the weighted averages.

plot(nplcm_reg_nest_strat,show_levels=c(0,1,2))
## ==[baker] plotting stratified etiologies with >> nested << model==
## [1] "==[baker] actual meanings of levels (by row):"
##   as.factor(SITE)1 as.factor(SITE)2
## 1                1                0
## 2                0                1

7.2.2.2 Nested models (K>1): continuous and discrete covariates

For an NPLCM with a continuous covariate, we can also plot the estimated CSCF for each cause. The dotted lines plot the posterior mean CSCFs and the shaded blue/red area around the curves corresponds to \(2.5\%\) and \(97.5\%\) posterior quantiles. The following code produces such an example based on the fitted object nplcm_reg_nest. Truth can be plotted as well if supplied to the plotting function; this would only possible in simulation studies.

a <- 1
DISCRETE_BOOL <- data_nplcm_reg_nest$X$SITE==a
plot(nplcm_reg_nest, stratum_bool = DISCRETE_BOOL)
## ==[baker] plotting etiology regression with 
##         >> nested << model for 
##         BrS Measure slice =  1 :  MBS1  .==

8 Real data example: PERCH study

Since the PERCH study data is not currently public and freely accessible, we provide example code of running nplcm() with model specifications (model_options) and model fitting options (mcmc_options). In order to more fully illustrate the functionality of baker, we use a pre-run posterior analysis of PERCH data and demonstrate the summary and visualization of a regression model using baker.

8.1 Data cleaning and preparation

# specify cause list:
cause_list <- c("HINF","PNEU","ADENO","HMPV_A_B","PARA_1","RHINO","RSV","other")
# specify measurements:
patho_BrS_NPPCR      <- c("HINF","PNEU","ADENO","HMPV_A_B","PARA_1","RHINO","RSV")

patho_SS_BCX         <- c("HINF","PNEU")

# bronze-standard measurements:
BrS_object_1 <- make_meas_object(patho_BrS_NPPCR,"NP","PCR","BrS",cause_list)

# silver-standard measurements:
SS_object_1  <- make_meas_object(patho_SS_BCX,"B","CX","SS",cause_list)

# parent directory for testing code (LOCAL):
working_dir <- tempdir()

# date stamp for analysis:
Date        <- gsub("-", "_", Sys.Date())
SITENAME    <- "05SAF"
result_folder  <- paste0(working_dir,Date,"data_analysis",SITENAME)
dir.create(result_folder)

8.2 Model specification and fitting

We can specify a model that have etiology depends on age, severity status, and HIV status, and let FPRs to depend on age and HIV status. We use both BrS and SS data for estimation. This can be achieved by the following code.

# specify model: 
model_options <- list(likelihood   = list(cause_list = cause_list,        
                                          k_subclass = c(5),                    
                                          Eti_formula = ~ -1+as.factor(AGE)+as.factor(ALL_VS)+as.factor(HIV2),             
                                          FPR_formula = list(
                                            NPPCR =  ~ -1+as.factor(AGE)+as.factor(HIV2) )),  
                      use_measurements = c("BrS","SS"),  
                      prior  = list(
                        Eti_prior  = c(2,2),  
                        TPR_prior   = list(
                          BrS  = list(info  = "informative",
                                      input = "match_range",
                                      val   = list(
                                        NPPCR = list(up = list(rep(0.9,length(BrS_object_1$patho))),
                                                     low = list(rep(0.5,length(BrS_object_1$patho)))
                                        ))
                          ),
                          SS   = list(info  = "informative",
                                      input = "match_range",
                                      val   = list(
                                        MSS1 = list(up = list(rep(0.2,length(SS_object_1$patho))),
                                                    low = list(rep(0.05,length(SS_object_1$patho))) 
                                        ))))   
                      )
)

We then specify the MCMC options:

# specify MCMC options:
mcmc_options <- list(n.chains   = 1,
                     n.itermcmc = 200,
                     n.burnin   = 100,
                     n.thin     = 1,
                     individual.pred = TRUE,
                     ppd             = TRUE,
                     result.folder = result_folder,
                     bugsmodel.dir = result_folder)

In the following, we fit the model using the above model and MCMC specifications. We commented the code out because the actual PERCH data is not public. We have produced results and figures, which are shown in the next section. Here we assume the data_nplcm has been loaded from the real data (not publicly available yet).

# store results:
dput(data_nplcm,file.path(mcmc_options$result.folder,"data_nplcm.txt"))
dput(clean_options,file.path(mcmc_options$result.folder,"data_clean_options.txt"))

assign_model(model_options,data_nplcm)

rjags::load.module("glm")
perch_fit <- nplcm(data_nplcm,model_options,mcmc_options)

8.3 Results

We can summarize the fitted object using summary()

# summary:
summary(perch_fit)
# [baker] summary: model structure 
#            fitted type:  reg_nest_strat 
# ---
#      name measurements:  MBS MSS MGS 
# slices of measurements:  1 1 0 
#                 nested:  TRUE 
# ---
#             regression:  
#                   etiology:  TRUE 
#                   name FPR:  NPPCR 
#                        FPR:  TRUE 
# ---
# all discrete predictor:  
#                   etiology:  TRUE 
#                   name FPR:  NPPCR 
#                        FPR:  TRUE 
# 
# ------- strata definitions (by row) -----------
#   as.factor(AGE)0 as.factor(AGE)1 as.factor(ALL_VS)1 as.factor(HIV2)1
# 1               1               0                  0                0
# 2               0               1                  0                0
# 3               1               0                  1                0
# 4               0               1                  0                1
# 5               1               0                  0                1
# 6               0               1                  1                0
# 7               1               0                  1                1
# 8               0               1                  1                1
# 
# ------- posterior summary -----------
# $`Posterior distributions of CSCFs for stratum: 1; weight: 0.4008`
#      cause   eti_mean      ci_025     ci_975
# 1     HINF 0.05166265 0.002158625 0.14952215
# 2     PNEU 0.01367464 0.001942877 0.03410881
# 3    ADENO 0.02755682 0.007606155 0.07437066
# 4 HMPV_A_B 0.04024640 0.011691264 0.08843427
# 5   PARA_1 0.02005626 0.005066523 0.04550176
# 6    RHINO 0.01596415 0.001254172 0.05131923
# 7      RSV 0.63260225 0.528099705 0.73257237
# 8    other 0.19823683 0.073161901 0.30355782
# 
# $`Posterior distributions of CSCFs for stratum: 2; weight: 0.1374`
#      cause   eti_mean       ci_025     ci_975
# 1     HINF 0.19318765 0.0312842667 0.41066459
# 2     PNEU 0.01918998 0.0004526772 0.05190771
# 3    ADENO 0.10152130 0.0282180990 0.22350238
# 4 HMPV_A_B 0.07145761 0.0065136730 0.16980122
# 5   PARA_1 0.06044748 0.0218096814 0.11084968
# 6    RHINO 0.19438414 0.0784071077 0.39107923
# 7      RSV 0.24012187 0.1166834127 0.39979477
# 8    other 0.11968998 0.0278590565 0.24438407
# 
# $`Posterior distributions of CSCFs for stratum: 3; weight: 0.229`
#      cause   eti_mean       ci_025     ci_975
# 1     HINF 0.05376153 0.0004231633 0.17367243
# 2     PNEU 0.02476798 0.0021959521 0.08402144
# 3    ADENO 0.06240029 0.0160461755 0.12986435
# 4 HMPV_A_B 0.07908623 0.0216718745 0.14806944
# 5   PARA_1 0.01171843 0.0027691013 0.03116896
# 6    RHINO 0.05509150 0.0006932584 0.17327738
# 7      RSV 0.70220153 0.5239905823 0.87145158
# 8    other 0.01097253 0.0005566912 0.08269008
# 
# $`Posterior distributions of CSCFs for stratum: 4; weight: 0.0458`
#      cause   eti_mean       ci_025     ci_975
# 1     HINF 0.08555718 6.467730e-04 0.55807698
# 2     PNEU 0.04883619 9.547829e-05 0.22913261
# 3    ADENO 0.18270059 2.726466e-02 0.49957376
# 4 HMPV_A_B 0.01237290 3.819012e-04 0.04875143
# 5   PARA_1 0.06447629 7.134228e-03 0.19808679
# 6    RHINO 0.04318612 2.718836e-03 0.15401763
# 7      RSV 0.03270711 4.562130e-03 0.10813316
# 8    other 0.53016363 9.850885e-02 0.81265527
# 
# $`Posterior distributions of CSCFs for stratum: 5; weight: 0.0706`
#      cause    eti_mean       ci_025     ci_975
# 1     HINF 0.019893074 7.335684e-05 0.13859220
# 2     PNEU 0.025071515 1.694575e-04 0.09301997
# 3    ADENO 0.048805764 3.085044e-03 0.13433229
# 4 HMPV_A_B 0.006976696 2.860040e-04 0.04299826
# 5   PARA_1 0.020952643 1.696853e-03 0.07661957
# 6    RHINO 0.003127074 1.531582e-04 0.02162442
# 7      RSV 0.077598604 1.487724e-02 0.19364007
# 8    other 0.797574630 6.341936e-01 0.95400327
# 
# $`Posterior distributions of CSCFs for stratum: 6; weight: 0.063`
#      cause   eti_mean       ci_025    ci_975
# 1     HINF 0.13319305 0.0024619564 0.4523032
# 2     PNEU 0.02423749 0.0008789122 0.1075467
# 3    ADENO 0.17579299 0.0688983557 0.3408715
# 4 HMPV_A_B 0.10883529 0.0154968277 0.3141081
# 5   PARA_1 0.03086868 0.0034628218 0.1007166
# 6    RHINO 0.31999102 0.0651708679 0.5771946
# 7      RSV 0.20363695 0.0907099811 0.3587755
# 8    other 0.00344452 0.0003662240 0.0121615
# 
# $`Posterior distributions of CSCFs for stratum: 7; weight: 0.0477`
#      cause   eti_mean       ci_025     ci_975
# 1     HINF 0.01914993 0.0001100869 0.08326242
# 2     PNEU 0.11432073 0.0020279862 0.28610844
# 3    ADENO 0.33937777 0.1264599587 0.58439434
# 4 HMPV_A_B 0.03976130 0.0019585291 0.14169446
# 5   PARA_1 0.03760326 0.0046599167 0.10883773
# 6    RHINO 0.04437827 0.0003947048 0.24210030
# 7      RSV 0.28519761 0.0821113625 0.58036951
# 8    other 0.12021112 0.0068709631 0.37489076
# 
# $`Posterior distributions of CSCFs for stratum: 8; weight: 0.0057`
#      cause   eti_mean       ci_025     ci_975
# 1     HINF 0.02913838 0.0011279548 0.13242488
# 2     PNEU 0.07435222 0.0005752645 0.28585386
# 3    ADENO 0.57735286 0.2651868175 0.88772588
# 4 HMPV_A_B 0.03514250 0.0010853396 0.15890357
# 5   PARA_1 0.05700753 0.0069257999 0.19142499
# 6    RHINO 0.14304200 0.0080589018 0.58842492
# 7      RSV 0.05644195 0.0072323347 0.16293520
# 8    other 0.02752255 0.0017522089 0.09985373
# 
# $`Posterior distributions of CSCFs (across all levels using weights: (0.401,0.137,0.229,0.046,0.071,0.063,0.048,0.006))`
#      cause   eti_mean      ci_025     ci_975
# 1     HINF 0.07435305 0.019851662 0.15269232
# 2     PNEU 0.02520255 0.005961936 0.05351450
# 3    ADENO 0.08166565 0.048916957 0.12769800
# 4 HMPV_A_B 0.05407090 0.030036233 0.08833299
# 5   PARA_1 0.02752423 0.012244731 0.04519004
# 6    RHINO 0.07101062 0.029416593 0.13884296
# 7      RSV 0.48105884 0.396922382 0.58563672
# 8    other 0.18511416 0.115807725 0.24554547

We can also visualize rich information about the posterior distribution of the population level CSCFs in the cases using generic function plot():

# visualize results:
plot(perch_fit)

The figure for overall CSCFs for all the pathogens are shown in the figure below:

We can also produce individual-level etiology estimates given each individual case’s observed measurements:

# individual prediction:
get_individual_prediction(perch_fit)
 #       HINF PNEU ADENO HMPV_A_B PARA_1 RHINO  RSV other
 #  [1,] 0.03 0.00  0.02     0.03   0.04  0.03 0.49  0.36
 #  [2,] 0.59 0.01  0.06     0.04   0.02  0.05 0.11  0.12
 #  [3,] 0.05 0.01  0.07     0.06   0.02  0.05 0.71  0.03
 #  [4,] 0.04 0.00  0.02     0.02   0.01  0.01 0.59  0.31
 #  [5,] 0.10 0.00  0.02     0.03   0.02  0.07 0.76  0.00
 #  [6,] 0.09 0.00  0.03     0.01   0.02  0.00 0.66  0.19
 #  [7,] 0.00 0.00  0.07     0.09   0.02  0.04 0.78  0.00
 #  [8,] 0.03 0.01  0.01     0.05   0.03  0.04 0.65  0.18
 #  [9,] 0.07 0.03  0.08     0.01   0.01  0.05 0.02  0.73
 # [10,] 0.03 0.00  0.29     0.04   0.00  0.02 0.33  0.29

9 Discussion

Our future aim is to add functionality to accommodate mixed categorical, ordinal, and continuous responses. First, the latent class model is naturally suited for multivariate discrete responses; our framework can be extended to handle multiple response levels by adding additional sets of response probability parameters for each response level. Second, ordinal and continuous responses may also be present and may further improve CSCF estimation; mixture components (in cases and controls) may need to be modified by latent random Gaussian vectors, a subset of which are linked to non-continuous responses via thresholding. Fast computational techniques for sampling Gaussian covariance matrices in multivariate probit models akin to [9] may be implemented. Finally, our current functions for NPLCM regression analyses will be expanded to accommodate higher dimensional covariates for CSCF and subclass weights using sparse Bayesian Additive Regression Trees [10].

10 References

1. PERCH Study Group. Causes of severe pneumonia requiring hospital admission in children without HIV infection from Africa and Asia: The PERCH multi-country case-control study. The Lancet. 2019;392:757–79.
2. Wu Z, Deloria-Knoll M, Hammitt LL, Zeger SL, PERCH Study Team the. Partially latent class models for case–control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2016;65:97–114.
3. Wu Z, Deloria-Knoll M, Zeger SL. Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics (Oxford, England). 2017;18:200–13.
4. Wu Z, Chen I. Probabilistic cause-of-disease assignment using case-control diagnostic tests: A latent variable regression approach. Statistics in Medicine. 2021;40:823–41.
5. Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics. 1996;52:797–810.
6. Plummer M et al. JAGS: A program for analysis of bayesian graphical models using gibbs sampling. In: Proceedings of the 3rd international workshop on distributed statistical computing. 2003.
7. Plummer M. JAGS: Just Another Gibbs Sampler. 2022.
8. Deloria Knoll M, Fu W, Shi Q, Prosperi C, Wu Z, Hammitt LL, et al. Bayesian estimation of pneumonia etiology: Epidemiologic considerations and applications to the Pneumonia Etiology Research for Child Health study. Clinical Infectious Diseases. 2017;64:S213–27.
9. Zhang Z, Nishimura A, Bastide P, Ji X, Payne RP, Goulder P, et al. Large-scale inference of correlation among mixed-type biological traits with phylogenetic multivariate probit models. The Annals of Applied Statistics. 2021;15:230–51.
10. Linero AR. Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association. 2018;113:626–36.

  1. Department of Biostatistics, University of Michigan, Ann Arbor↩︎

  2. Department of Biostatistics, University of Michigan, Ann Arbor↩︎

  3. Department of Biostatistics, The Johns Hopkins University, Baltimore↩︎

  4. Department of Biostatistics, University of Michigan, Ann Arbor; ; submit issues at https://github.com/zhenkewu/baker/issues. Acknowledgements - This work was supported by a Michigan Institute of Data Science (MIDAS) seed grant (to Z.W. and I.C.); the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318 to Z.W. and S.L.Z.); and the National Institutes of Health grants (P30CA046592 to Z.W.). We thank the PERCH study team led by Katherine O’Brien for providing the data and scientific advice, Maria Deloria-Knoll and Christine Prosperi for valuable feedback and Jing Chu for preliminary simulations. We also thank John Kubale for valuable feedback.↩︎