Methods: Development of a Multiple Regression Model

Appendix A.

Methods: Development of a multiple regression model.

To fulfillthe aim of determining the driving factors of HIV treatment costs, multiple regression models were employed.

In order to specify relevant candidate predictors for the model, first, thenumber of observations and the number of missing observations for each category of each potential predictor were calculated (Table 1 in the main text).

We found that levels “NNRTI” and “Mixed” of predictor “resistance” had a small sample size.However, there was no possible method based on medical reasoning tocombine the categories, and the predictor was included in the model without modifications; therefore, the coefficient estimates for these categories must be interpreted with caution.

Variance inflation factor (VIF) was used to assess potential multicollinearity among thepredictors. Perfect collinearity took place between two variables that describe comorbidity: number of diseases and their severity at the level of “none.” To avoid the impact of this collinearity on the model,these variables were combined into one variable that describes comorbidity in terms of both severity and number of diseases. The severity variable in this case describedthe severity of the severest among the diseases. With this modification,VIF analysis showed acceptable results and all variables were considered as candidate predictors.

Variable selection.

Thereafter, statistical significance of each covariate was compared using F-tests and applying the classical ANOVA analysis of a linear model. For all analyses, statistical significance was determined at p0.05.

Further, a backward stepwise variable selection procedure was performed. The selection began with a saturated model that included all effects of interest as well as all first, second, and third-factor interactions between covariates. All variables and interactions were extracted from the model stepwise, while conclusions on the selection of variables or interactions were based on the Akaike information criterion(AIC) and residual deviance.

No interactions between predictors showed statistical significance, therefore, a simple additive model was adopted. Those predictors that showed no statistical significance but added statistical power were included in the model.

Additionally, we used regression subset selection and considered possible subsets of the pool of the predictors and compared these using AIC.

When selecting candidate predictors we considered variables that were predetermined to be included based on previous research and theoretical interest to explore the links between these variables and total costs, as well as using the statistical methods described above. The resulting system was specified as an additive model with 14 explanatory variables: age group, gender, time since diagnosis of HIV group, CDC classification, therapy class, therapy line, drug resistance, viral load, CD4-T cell count, laboratory alanine aminotransferase test(Lab ALT), laboratory low-density lipoprotein cholesterol test (Lab LDL), laboratory serum creatinine level test (Lab Creat), comorbidity, and disability.

First, a classic linear model was employed to analyze the relationship between mean total costs and patientcharacteristics. The Breuch Pagan test was applied to check formally for the presence of heteroscedasticity in the linear model. A positive result (p=0.00404) rejected the null hypothesis of constantvariance, therefore, in the further analysis the following models were used: an OLS of log transformed costs and generalized linear models with a log link function and an exponential familyof the error term.

Distributional characteristics of cost data.

In order to develop a regression model, distributional characteristics of the data on total costs were initially investigated as follows: (i) a histogram of the total costs was plotted, (ii) values of skewness and kurtoses were computed, (iii) the Shapiro-Wilk test was used to investigate whether the log transformation yielded normalization, (iv) quintile-quintile plots were used to compare the costdistributions and theoretical distributions: lognormal, gamma and inverse Gaussian, (v) mean-variance relationshipswere plotted,with means and variances of the total costs being computed within each level of each variable and a plot of the variance versus the mean being built [18]. R statistical software (version 3.1.1) was used.

For the selected patients (n=1022) a histogram of the annual total costs is given by Figure 2 (Appendix A). The histogram shows distributional characteristics of the total costs data.

It can be seen that the distribution is skewed to the right, which is common for data on expenditures in healthcare. The numerical measure of skewness of the present data tooka value of 2.27 and that ofkurtosis was 10.66. Both measures indicated a substantial positive skewness. Log transformation of the total costs data did not yield a normal distribution (Shapiro-Wilk W-test: W=0.7963, p2.2e-16), however, the values of skewness (0.86) and kurtosis (4.34) were altered.

For the further investigation of the distributional characteristics of the total costs, quintile-quintile (Q-Q) plots were built. Figure 3(Appendix A) shows plots of the total costs against three selected theoretical distributions, in particular, gamma distribution with shape parameter = 6.489 and scale parameter = 3465.14; inverse Gaussian distribution with mean(mu) = 22485.97 and lambda = 145915.7, where Var = mu3 /lambda; and lognormal distribution with parameters meanlog= 10.021 andsdlog = 0.337. According to the plots, total costs werebest approximated by the inverse Gaussian distribution.

As long as a log transformation of the dependent variable did not normalize its distribution, the relationship between its mean and variance was analyzed. The values of these quintiles were calculated for each category of each variable. The following Figure 4(Appendix A) illustrates the mean-variance relationship on the log scale. The line was fit employing weighted least squares using degrees of freedom associated with each variance as weights [19]. The resulting slope takes a value of approximately 2.94, supporting the initial preference for the inverse Gaussian family (variance = mean3)

Regression model.

The resulting model was specified as follows:

Denotation: Age group: a = 1,2,3,4; Gender: b = 1,2; Time since diagnosis of HIV(HIVtime): c = 1,2,3; CDC- classification: d =1,2,3; Therapy class: h=1,2,3,4,5; Therapy line: i = 1,2,3; Drug resistance: j= 1,2,3,4,5,6,7; Viral load: k=1,2,3; CD4-T cells count: l= 1,2,3;Lab ALT: m=1,2; Lab LDL: n=1,2; Lab Creat: p=1,2,3; Comorbidity: r=1,2,3,4,5; Disability: v= 1,2,3.

Let denote the mean total costs for individuals of the ath age group and the bth gender who have clinical characteristics c, d, h, i, j, k, l, m, n, p, r, v as denoted above, then the model can be expressed as:

where is a constant, is the effect due to the ath age group, is the effect due to the bth gender group, is the effect due to the vth disability degree, is the effect due to the dth class according to the CDC-classification for HIV infection, is the effect due to the ith therapy line, is the effect due to the mthgroup of alanine aminotransferase test results, is the effect due to the pthgroup of laboratory serum creatinine testresults, is the effect due to the nth group of laboratory low-density lipoprotein cholesterol test results, is the effect due to the rth number of diseases and severity indexgroup, is the effect due to the kth viral load group, is the effect due to the cthtime after initial diagnosis of HIV infection before entering the survey (in years) group, is the effect due to the lth CD4-T cellcount group, is the effect due to the jthgenotypic antiretroviralresistance group, and is the effect due to the hthARV class.

Before fitting the given GLMs, reference categories for each predictor were specified. If previous literature or knowledge allowed us to make a specific hypothesis we used these planned contrasts in the model. Otherwise reference categories were defined for each predictor as the factor level with the largest number of observations among other levels for the respective predictor (see Table 1).All categorical variables were coded using dummy coding so that each level of factor was compared with the mean of the reference category.

The contrasts in the resulting model were specified as follows:

The following multiple regression models were applied and compared: (i) a linear regression on log transformed data, (ii) GLM with gamma family and log link, and (iii) GLM with inverse Gaussian family and log link function.

Model performance.

The adequacy of the models was assessed using goodness-of-fit measures, quantitative predictive indices, a plot of residuals, and a plot of predicted versus observed costs. The goodness-of-fit was appraised using R2 for the linear model and McFadden’s pseudo-R2 for each GLM, the value of which was computed as: where is the log-likelihood for the fitted GLM and is the log-likelihood for the model with just the constant term. The ability to predict was assessed using mean absolute error (MAE), root mean squared error (RMSE),and bias measures:

where denotes the predicted mean of the total costs for patient, and denotes the observed values of costs for this patient. The obtained estimates are given in Table 9 (Appendix A).

The quantitative predictive indices were computed on an independent data set. For the further analysis, the model with inverse Gaussian family and the log link function was preferred. Figure 5(Appendix A)illustrates a plot of predicted versus observed costs for this model.

The value of McFadden’s pseudo-R2 suggested that 50% of the total costs could be explained by the selected patient characteristics. The resulting coefficients represent the percentage change in the annual total cost from its average as a response to a one-unit shift in the explanatory variable compared with the reference category. A Wald test was performed to test whether the pairwise difference between the coefficient of the reference class and the other class is different from zero. The p-values in Table 4 indicate whether each level’s mean is significantly different from the reference level’s mean.

Cost ratios.

Application of the log link function was supported by the given distribution of total costs and the model showeda good fit, which made it possible to compute cost ratios between patients with different characteristics using the estimated coefficients. Following the reasoning given by Bloughet al. (2000)[18], ratios of mean total costs were calculated based on the factors: ARV class, gender, CD4-T cell count, drug resistance, and comorbidity.

We constructed an analytical form of the cost ratio for a patient who differs from the reference patient only in these selected characteristics. Following the notation given in the section on the model specification,these variables have the following number of categorical classes: therapy class: h=1,2,3,4,5; CD4-T cell count: l= 1,2,3; comorbidity: r=1,2,3 (with levels: ≤2nonsev, >2nonsev, >2severe respectively), gender: b=1,2, drug resistance: j=1,2 (with levels: “no resistance”, “at least three classes”respectively).

As long as the estimated coefficients resulted not on the true scale it is reasonable to analyze the logarithm of the ratio between the total costs for the reference patient and the patient defined above. It is given by:

When coding the model, all the reference categories were set to 0. Therefore, using estimated coefficients the costs ratio is simplified to:

Confidence intervals for the corresponding cost ratios were also computed using the variance and variance-covariance matrix of the parameter estimates[18]. For example, the following equation illustrates the calculation of variance for the ratio of the mean total costs of male individuals with therapy class “PI-stand”, CD4 = “>500”, comorbidity = “≤2nonsev” and drug resistance = “no resistance” compared with the mean total costs of male individuals with therapy class “PI-stand”,CD4 = “200-500”, comorbidity = “>2nonsev” and drug resistance = “three classes”:

We obtained the result. Therefore, the 95% confidence interval for ranges from 0.20212 to 0.52176 with thecorresponding confidence interval for the true ratio being (1.224,1.685).

The cost ratios and respective confidence intervals are presented in Table 5 in the main text. The cell with a cost ratio of 1 indicates that all other ratios are estimated relative to these reference categories.

The part of the covariance matrix that isused for the calculations of the cost ratios are given in Table10(Appendix A).

Supplemental Digital Content. Figure 2, Histogram of annual total costs, that relates to the Appendix A and displays distributional characteristics of the annual total costs. This histogram shows skewness of the data to the right.

Supplemental Digital Content. Figure 3, Quintile-quintile (Q-Q) plots of the total costs against theoretical distributions, that relates to the Appendix A and further presents the distributional characteristics of the cost data and shows Q-Q plots of the total costs against three selected theoretical distributions: gamma, inverse Gaussian, and lognormal distributions.

Supplemental Digital Content. Figure 4, Mean-Variance relationship of the annual total costs, that relates to the Appendix A and illustrates mean-variance relationship of the total costs data on the log scale. It supports the choice of inverse Gaussian distribution for the regression analysis.

Supplemental Digital Content. Figure 5, plot observed vs predicted values, that relates to the Appendix A and gives a plot of observed values against predicted values for the model used in the regression analysis (GLM with inverse Gaussian family and the log link function).

Table 6. Description of the patients’ data for the patients who abandoned the survey during the first year of CORSAR (n=65).

Variable / Description / Categories / Percentage of observations, %
Age Group / Age group of a patient in years / 20-29 / 10.61
30-44 / 36.36
45-59 / 42.42
60+ / 10.61
n.a.* / 0.00
Gender / Gender / female / 7.58
male / 92.42
n.a. / 0.00
Education / The highest educational level achieved / graduated / 9.09
neither nor / 59.09
no school certificate / 1.52
n.a. / 30.30
Income / Stable or non-stable income / full-time employment / 33.33
pensioner / 18.18
other / 9.09
n.a. / 39.39
HIV related variables
Time since diagnosis of HIV / Time after initial diagnosis of HIV infection before entering the survey (in years) / 0 -10 / 50.00
10-20 / 25.76
>20 / 13.64
n.a. / 10.61
CDC class / Class according to the CDC classification system for HIV infection / Category A: Mildly symptomatic / 19.70
Category B: Moderately symptomatic / 37.88
Category C: Severely symptomatic / 37.88
n.a. / 4.55
Viral Load / HIV viral load (RNA copies/ml) / <50 / 66.67
50-500 / 10.61
>500 / 7.85
n.a. / 15.15
CD4-T / CD4-T cell count (cells/mm3) / >500 / 43.94
200-500 / 43.94
<200 / 10.61
n.a. / 1.52
Treatment related variables
Therapy Class / Assigned antiretroviral drugs classes / PI-ind / 0.00
PI-standard / 1.52
NNRTI / 39.39
mixed / 0.00
other / 59.09
n.a. / 0.00
Therapy Line / Combination antiretroviral therapy (cART) line / first-line / 28.79
second- and third-line / 6.06
beyond the third-line / 21.21
n.a. / 43.94
Resistance / Genotypic resistance against antiretroviral medication / no resistance / 83.33
three classes(PI, NNRTI, NRTI) and more / 0.00
NNRTI / 3.03
NRTI / 3.03
NRTI and NNRTI / 0.00
PI / 7.58
PI and NRTI / 3.03
n.a. / 0.00
General Health related variables
Lab ALT / Alanine Aminotransferase Test (U/L) / <110 / 83.33
≥110 / 6.06
n.a. / 10.61
Lab LDL / Low-Density Lipoprotein Cholesterol Test (mg/dL) / <200 / 65.15
≥200 / 0.00
n.a. / 34.85
Lab CREAT / Serum creatinine level test (mg/dL) / < 0.9 / 42.42
0.9-1.5 / 42.42
>1.5 / 3.03
n.a. / 12.12
Comorbidity / Number of concomitant diseases and degree the severity of the severest among the diseases. / ≤2non-severe / 36.36
≤2 severe / 7.58
>2 non-severe / 22.73
>2 severe / 7.58
none / 0.00
n.a. / 25.76
Disability / Disability index according to the German the Disabled Persons Act** / 0 – No disability / 34.85
<50 – Intermediate/ Moderate disability / 9.09
≥50 – Severe disability in activities of daily living / 30.30
n.a. / 25.76

*not availableobservations

**Grad der Behinderung (GdB), Deutsches Schwerbehindertenrecht

Table 7. Mean of costs (SD) across the eight healthcare provider sitesstratified by cost categories (Euro).

Site / Cost Categories
Total* / cART drugs / non-ARV medication / Inpatient / Outpatient / Out-of-pocket / Indirect
1 / 23124.27
(6632.40) / 19094.70
(4544.05) / 1107.28
(3789.70) / 1769.19
(3448.88) / 397.50
(382.53) / 310.40
(913.69) / 979.56
(1701.80)
2 / 21402.75
(10648.49) / 18517.58
(5232.45) / 1264.03
(93776.49) / 1260.77
(4909.78) / 159.73
(368.58) / 223.98
(426.97) / 983.34
(1679.37)
3 / 22574.55
(7415.51) / 19089.18
(4328.43) / 1874.75
(4707.85) / 1094.61
(3269.59) / 179.84
(277.04) / 123.14
(255.74) / 735.38
(2140.92)
4 / 21653.70
(6975.45) / 18057.28
(4872.19) / 1263.07
(2789.76) / 1039.17
(2854.22) / 393.20
(601.07) / 381.03
(1115.55) / 1349.07
(2743.64)
5 / 22649.77
(8704.50) / 19530.42
(5392.66) / 1066.29
(1731.84) / 1426.45
(4759.66) / 151.62
(213.50) / 144.13
(267.04) / 876.81
(2404.60)
6 / 23518.91
(11967.16) / 18721.63
(6666.76) / 1831.75
(4845.97) / 1952.96
(5430.55) / 204.64
(318.63) / 178.31
(385.81) / 5377.92
(10767.95)
7 / 23053.41
(8886.56) / 19586.70
(7154.42) / 967.79
(2021.07) / 972.73
(2945.61) / 294.21
(302.01) / 242.24
(433.47) / 3175.06
(5776.34)
8 / 22284.53
(9363.10) / 18526.03
(6202.73) / 1790.15
(3214.76) / 919.52
(2602.03) / 244.70
(338.19) / 212.93
(470.58) / 1560.66
(4155.76)

*The estimates of the annualized total costs presented in Table 7 include also negligible cost fractions e.g. massages, psychological support, nutrition consulting.

Table 8.Data on the annualized costs forpatients who completed both years of the CORSAR survey (n = 942).

Costcategory / Mean costs (SD)
(Euro), Data for the first year of CORSAR / Mean costs (SD)
(Euro), Data for the second year of CORSAR
Total costs / 22477.57(8809.45) / 22231.03(8786.13)
cART drugs / 18852.53(5297.44) / 18688.62(5289.48)
non-ARV medication / 1499.36(3718.50) / 1805.05(5034.45)
Out-of-pocket / 212.23(588.61) / 200.87(605.36)
Indirect / 1462.79(3997.91) / 1779.37(4175.84)
Hospital stay / 1246.98(3850.15) / 984.53(2894.06)
Outpatient costs / 237.04(365.61) / 240.06(391.10)
Outpatient rehabilitation / 81.92(502.85) / 63.88(433.77)
Medical gymnastics / 55.33(168.31) / 56.45(170.99)
Massages / 14.34(40.50) / 14.66(40.46)
Nutrition support / 6.23(39.49) / 5.94(35.51)
Inpatientrehabilitation / 136.29(704.82) / 145.90(804.59)

Table 9. Model performance

Model / Predicted Mean (SD) / R2or Pseudo-R2 / AIC / MAE / RMSE / Bias
GLM with inverse Gaussiandistribution and log link function / 22379.42 (6051.59) / 0.5029296 / 10499 / 4500.22 / 6566.329 / 19.16293
GLM with Gamma distribution and log link function / 22360.16 (5984.769) / 0.5020352 / 10575 / 4076.246 / 6788.719 / -534.6144
OLS, log (naïve) / 22215.57 (5169.697) / 0.4345 / - / 4388.54 / 6895.422 / -1246.26

Table 10. Variance-covariance matrix of the parameter estimates (partial).

Coefficients / Gender, male vs. female / CD4, >500 vs. 200 / CD4, >500 vs. 200-500 / Comorbidity, ≤2nonsev" vs. >2nonsev / Comorbidity, ≤2nonsev vs. >2severe / Therapy Class, PI-stand vs. Mixed / TherapyClass, PI-stand vs. NNRTI / TherapyClass, PI-stand vs. Other / TherapyClass, PI-stand vs. PI-ind / Resistance, no resistance vs. three classes
Gender, male vs. female / 1.351104e-03 / -5.517828e-05 / 3.617787e-05 / -3.681504e-05 / -1.369070e-04 / -5.676560e-05 / 9.074253e-05 / 7.205964e-05 / -1.763722e-05 / 3.806316e-06
CD4, >500 vs. 200 / -5.517828e-05 / 3.151038e-03 / 2.865989e-04 / 1.084944e-04 / -1.288990e-04 / -1.692336e-04 / 3.125375e-05 / -1.309653e-04 / -1.711770e-04 / 1.481077e-05
CD4, >500 vs.
200-500 / 3.617787e-05 / 2.865989e-04 / 6.392490e-04 / 5.468792e-05 / 7.468742e-05 / -4.670346e-05 / -5.079890e-05 / -1.708577e-05 / -7.397848e-05 / -5.999322e-05
Comorbidity, ≤2nonsev" vs. >2nonsev / -3.681504e-05 / 1.084944e-04 / 5.468792e-05 / 9.992591e-04 / 4.848478e-04 / 4.680435e-05 / 2.297290e-05 / -5.675181e-05 / 8.897262e-05 / 2.022199e-05
Comorbidity, ≤2nonsev vs. >2severe / -1.369070e-04 / -1.288990e-04 / 7.468742e-05 / 4.848478e-04 / 3.914086e-03 / 1.428715e-04 / 8.343213e-05 / -7.255907e-05 / 2.348085e-04 / -1.401594e-04
Therapy Class, PI-stand vs. Mixed / -5.676560e-05 / -1.692336e-04 / -4.670346e-05 / 4.680435e-05 / 1.428715e-04 / 2.561858e-03 / 3.994939e-04 / 3.834428e-04 / 3.962514e-04 / -1.108174e-04
TherapyClass, PI-stand vs. NNRTI / 9.074253e-05 / 3.125375e-05 / -5.079890e-05 / 2.297290e-05 / 8.343213e-05 / 3.994939e-04 / 9.259000e-04 / 4.461606e-04 / 3.705730e-04 / 1.542004e-04
TherapyClass, PI-stand vs. Other / 7.205964e-05 / -1.309653e-04 / -1.708577e-05 / -5.675181e-05 / -7.255907e-05 / 3.834428e-04 / 4.461606e-04 / 9.888171e-04 / 4.040868e-04 / 1.148104e-04
TherapyClass, PI-stand vs. PI-ind / -1.763722e-05 / -1.711770e-04 / -7.397848e-05 / 8.897262e-05 / 2.348085e-04 / 3.962514e-04 / 3.705730e-04 / 4.040868e-04 / 5.097338e-03 / -8.778012e-04
Resistance, no resistance vs. three classes / 3.806316e-06 / 1.481077e-05 / -5.999322e-05 / 2.022199e-05 / -1.401594e-04 / -1.108174e-04 / 1.542004e-04 / 1.148104e-04 / -8.778012e-04 / 4.985138e-03