Supplemental Materials

Empirical Bayes MCMC Estimation for Modeling Treatment Processes, Mechanisms of Change, and Clinical Outcomes in Small Samples

by Timothy J. Ozechowski, 2014, Journal of Consulting and Clinical Psychology

http://dx.doi.org/10.1037/a0035889

S1. MCMC Posterior Simulation

Markov Chain Monte Carlo (MCMC) simulation generates a chain of sequentially dependent draws from known densities resembling the posterior but which are simpler from which to draw samples. In one popular MCMC approach, known as Gibbs sampling, complex joint posterior distributions are decomposed into their constituent conditional distributions. Values for each parameter are sampled from the corresponding posterior conditioned on the values of the other parameters. An alternative MCMC sampling algorithm, known as Metropolis-Hastings, draws samples from a known density referred to as the proposal distribution, p(θ), which has mass over the same range as the posterior and from which it is relatively straightforward to sample. A given draw from p(θ) is admitted to the target distribution t(θ) (i.e., the posterior) based on an acceptance/rejection rule. Specifically, at a given point s in the MCMC sequence, a draw from the proposal distribution θ* is admitted to t(θ) if the proportion of values equal to θ* already admitted to t(θ) is greater than the proportion of values in t(θ) equal to the most recently admitted draw θs-1. If so, then θ* is admitted to t(θ) and becomes θs. Otherwise, θ* is rejected and θs-1 is replicated in t(θ) and becomes θs. Both the Gibbs and Metropolis-Hastings algorithms should be allowed to iterate until stationarity is achieved in the Markov chain, which is a necessary condition for the simulated posterior to converge to the true posterior.

S2. Demographic and Clinical Characteristics of the Sample

Participants in the original clinical trial were 120 adolescents and their families residing in greater Albuquerque, New Mexico. The adolescent sample was 80% male and 20% female. The majority of adolescents in the sample identified themselves as being of either Hispanic (46.7%) or Anglo (38.3%) ethnic origin. The remaining 15% identified as either Native American or having a mixed ethnic background. The mean age of the adolescent sample was 15.6 years (SD = 1.0). Participants were referred to the research clinic to receive outpatient adolescent substance abuse treatment. Adolescents meeting official diagnostic criteria for substance abuse or dependence were eligible to participate in the study. The study focused on illicit substance abuse. As such, adolescents whose primary substance of abuse was alcohol or tobacco were excluded from participation.

Marijuana was the primary substance of abuse among the adolescents in the parent clinical trial. On average, adolescents reported using marijuana on 56.8% (SD = 31.8) of the past 90 days prior to treatment intake. Use of any substance other than tobacco was reported on an average of 60% (SD = 31.0) of the past 90 days prior to treatment intake. In addition to substance abuse, adolescent participants exhibited substantial comorbid emotional and behavioral problems. Specifically, 70.8% of adolescents were rated at or above the borderline clinical threshold for delinquent behavior problems based on parent reports on the Child Behavior Checklist (CBCL; Achenbach, 1991). Moreover, 30% of the adolescent sample scored at or above the threshold for clinical depression on the Beck Depression Inventory (BDI; Beck & Steer, 1987) based on adolescent-specific guidelines for the BDI specified by Roberts, Lewinsohn, and Seeley (1991).

S3. Coder Training

All training activities were conducted by the author of this article. The initial components of the training process focused on didactic study of the principles and practices of Functional Family Therapy (FFT) as well as of the Functional Family Therapy Coding and Rating Scale (FFT CARS) coding manual. The coders practiced using the FFT CARS to code written therapy transcripts initially and then were trained to code digitized video files of FFT sessions. Videotapes of FFT sessions were converted to digital MPEG files using a digital video data acquisition and management system. Coders used a specially designed Windows-based graphical user interface to access the MPEG files, which were stored within a relational database on a local server. The coders viewed the digital video files on a computer workstation monitor and entered codes in real time for each discernible therapist intervention using a numerical keypad. Time-in and time-out codes were automatically associated with each coded entry. At the end of each observed FFT session, the collection of video segments and associated codes was stored as an ASCII text file that could be readily imported into any commonly used statistical software package for analytic purposes. Each week during the training period, the coders independently observed and coded a video-recorded FFT session from an archive of recordings designated for training purposes. Each week, the trainer computed rates of agreement between the two coders on each of the FFT CARS intervention and context codes. The training continued until the coders were able to exhibit 70% agreement on all codes for four consecutive weeks. The training process lasted approximately four months.

S4. Positive Definite Posterior Covariance Matrices and the Cholesky Factorization

In general, a matrix of numerical elements is said to be positive definite if it is square, symmetrical, and all its eigenvalues are greater than zero (Leon, 2009). Briefly, an eigenvalue is the variance associated with a principal component, which is a weighted combination of observed variables capturing a unique portion of the overall variance—similar to a factor in factor analysis (Wothke, 1993). A non-positive eigenvalue is a “red flag” signaling that the corresponding weighted combination of variables has a zero variance, which often is attributable to high degrees of collinearity or redundancy between variables. Non-positive definite covariance matrices are problematic mathematically because they cannot be inverted due to division by zero. In the ML estimation setting, a non-positive definite input covariance matrix, therefore, precludes performing the matrix algebra required to optimize the likelihood function and obtain model parameter estimates. Likewise, if the estimated or model-implied covariance matrix in an SEM analysis is non-positive definite, it cannot be shown that the observed data are plausible given the model (i.e., the statistical model cannot be shown to fit the data). From a Bayesian perspective, non-positive definite posterior covariance matrices raise suspicions regarding the validity of the model as a plausible representation of processes by which the data were produced.

If a given SEM covariance matrix M is positive definite, then the Cholesky factorization may be obtained such that M = L∙LT where L is the lower triangular portion of M and LT is the transpose of L (i.e., a re-expression of L with the columns and rows reversed). If the Cholesky factorization of M can be computed, then M is positive definite. If the Cholesky factorization of M cannot be computed, then M is not positive definite.

In cases where the Cholseky factorization of M cannot be obtained, the source of the non-positive definiteness may be isolated by decomposing M into smaller submatrices (e.g., all 2 ´ 2 submatrices comprising M) and implementing the Cholesky factorization on each submatrix. Submatrices for which the Cholesky factorization fails would be indicative of parameters in the SEM that may be misspecified. Careful respecification of such parameters would most likely rectify the non-positive definiteness in M.

S5. Obtaining R Using the SAS NLMIXED Procedure

Although PROC MCMC does not compute the R index, R may be computed using the NLMIXED procedure within SAS/STAT software package. First, the posterior samples from each MCMC chain must be output to a unitary SAS data set in which all chains are “stacked” vertically. Next, a random-intercept-only model may be fit to this stacked data set using the NLMIXED procedure, with “chain” being the clustering, or Level 2, unit and the posterior draws within each chain for a given parameter constituting the Level 1 observations. The NLIMXED procedure automatically computes W and B as parameters of the random-intercept-only model. The value of R then may be computed by entering the computational formula for R into the ESTIMATE statement in NLIMXED. For a large number of parameters, the execution of this procedure may be automated by embedding the NLMIXED code within a SAS macro program (see Figure S3).

S6. Assessment of Model Fit Based on the Posterior Predictive Distribution

Typically, comparisons between the observed data and predicted values sampled from the PPD are based on test statistics, or scalar summaries computed from samples of observations (see Gelman et al., 2004, p. 162; Gelman & Meng, 1996, p. 197). For continuous normally distributed observed variables, the most efficient summary test statistics are the sample mean and standard deviation. The median, mode, minimum, and maximum values may be utilized as test statistics as well. In a Bayesian assessment of model fit, a given test statistic based on the observed sample data, T(yobs), may be compared to a set of corresponding test statistics computed from R simulated samples or replications drawn from the PPD, T(yr), where r = 1, . . ., R. The most straightforward way of comparing T(yobs) and T(yr) is to plot a histogram of the T(yr) values and pinpoint the location of T(yobs) on this histogram. If the model under investigation exhibits a good fit to the data from a Bayesian point of view, then the value of T(yobs) would be expected for fall near the center of the histogram of T(yr) values, suggesting that the observed data are highly plausible given the Bayesian posterior parameter estimates. Alternatively, one may compute the proportion of T(yr) values that are equal to or greater than T(yobs), that is, Pr [T(yr) ≥ T(yobs)]. This proportion, known as the posterior predictive p value (ppp), expresses the probability that the predicted values derived from the Bayesian estimates of the model parameters are more extreme than the observed data. For a good-fitting model ppp would be expected to equal approximately 0.5, again indicating that T(yobs) falls in the center of the distribution of predicted test statistics T(yr) and that the observed sample data are highly plausible given the Bayesian estimates of the model parameters (see Muthén & Asparouhov, 2012).

A Robust Assessment of Model Fit Given a Small Sample

When sample sizes are prohibitively small, any given sample statistic T(yobs) may be biased, which in turn may lead to distorted or incorrect assessments of model fit based on comparisons between T(yr) and T(yobs). Therefore, rather than comparing T(yr) with a single point estimate T(yobs) calculated from the sample data, in the current demonstration analysis, values of T(yr) were gauged against the sampling distribution of T(yobs), which was obtained using bootstrap resampling of the sample data (Efron & Tibshirani, 1993). Briefly, bootstrap resampling entails randomly selecting observations with replacement from a given sample of size n until B bootstrap samples of size n have been drawn using only the observations in the original sample y1, …, yn. A given test statistic T(yb) may be computed for each of the B bootstrap samples (b = 1, …, B). The collection of bootstrap test statistics simulates the sampling distribution of T(yobs) with the expected value estimated as μB= 1Bb=1BT(yb) and variance estimated as σB2=1B-1 ∙ b=1B(T(yb)-μB)2. The square root of the estimated variance of the bootstrap sampling distribution, σB, is a robust estimate of the standard error of T(yobs), which quantifies the sampling variability associated with T(yobs). With regard to Bayesian assessment of model fit, a robust estimate of the standard error of T(yobs) permits the specification of a (frequentist) confidence interval within which values of T(yrep) may be regarded as being consistent with T(yobs) and beyond which values of T(yrep) are likely to be discordant with T(yobs), that is, biased due to model misspecification or lack of fit. Comparing values of T(yrep) against a bootstrap “minimal bias” confidence interval for T(yobs) rather than a single point estimate robustifies the assessment of model fit against bias and imprecision inherent in T(yobs) due to the small sample size.

To specify a confidence interval defining a region of “minimal-bias” surrounding T(yobs), a useful convention set forth by Schafer and colleagues holds that bias in a statistical estimate becomes appreciable when it exceeds 50% of one standard error of the estimate (Collins, Schafer, & Cam, 2001; Schafer & Kang, 2008). Using this guideline, a confidence interval within which values of T(yr) may be regarded as minimally biased with regard to the observed data may be specified as

I: (μB – 0.5∙σB) ≤ T(yr) ≤ (μB + 0.5∙σB). (S1)

The expression in Equation S1 states that values of T(yr) between –0.5 and +0.5 estimated standard errors from the estimated mean of the sampling distribution of T(yobs) are contained in the minimal bias interval. Furthermore, a coverage probability Pc for I may be estimated as the proportion of values of T(yr) that are contained within I. A good-fitting model from a Bayesian perspective would be expected to produce estimates of Pc close to 1.0 for each variable in a given statistical model, indicating that nearly all of the posterior predicted values are contained in the minimal bias interval I. Values of Pc substantially lower than 1.0 (i.e., less than 0.90) would indicate that a sizable proportion of posterior predicted values are not contained in I, suggesting the model does not fit the data well for a given variable.

In the current demonstration analysis, T(yobs) was chosen to be the sample mean for each variable in the SEM analysis and T(yr) (r = 1, …, R) was a corresponding set of predicted means based on R = 500 simulated samples from the PPD, which were generated by the PROC MCMC program (see line 89 of the MCMC code presented in Figure S1). The decision to set R = 500 was informed by Gelman et al.’s (2004, p. 164) demonstration of the PPD for Bayesian model checking in which 200 PPD samples were simulated, with justification that “we use only 200 draws (from the PPD) … to illustrate that a small simulation gives adequate inference for many practical purposes” (p. 144). In the current demonstration analysis, setting R = 500 was well in excess of Gelman et al.’s (2004) specification, thereby providing enhanced assurance that the PPD contained sufficient information to evaluate model fit. Increasing R beyond 500, however, may lead to marked increases in the computer processing time for the MCMC procedure and therefore is not recommended.