Supplementary materials

Title: A unified GMDR method for detecting gene-gene interactions in family and unrelated samples with application to nicotine dependence

Authors: Guo-Bo Chen1, Nianjun Liu1, Yann C. Klimentidis1, Xiaofeng Zhu2, Degui Zhi1,Xujing Wang3, Xiang-Yang Lou1,*

Affiliations:

1Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA;

2Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA;

3Department of Physics and the Comprehensive Diabetes Center, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA

* Correspondence to:

Xiang-Yang Lou, Ph.D.

1665 University Boulevard, RPHB 327

Birmingham, Alabama 35294-0022

Phone: 205-975-9145

Fax: 205-975-2541

E-mail:

1

Supplementary materials

Materials and Methods

Principal Components Analysis for Correction of Population Structure in Family and Unrelated Samples

Assume we have a dataset that consists of both families and unrelated individuals. For the purpose of illustrating the outline of our method, we use nuclear families only, but its application to extended families is straightforward in theory. Of these nuclear families, there are individuals, where is the number of all individuals in each nuclear family, including founders and children and with the first two (j =1 or 2) being the father and mother for ease of notation. In addition to these families, we have singletons (unrelated individuals) who have no relation to any other member. In total, we have individuals and nuclear families (for convenience of notation, we consider each unrelated individual as a nuclear family with only one member, i.e., when ). We assume the founders of nuclear families are unrelated and any two individuals from different families are not related either.

To prevent spurious association,the conventional strategy treats a nuclear family conditional on the founders, but it partially excludes the contribution from founders, and thus tends to reduce the statistical power that the data could otherwise have achieved. To tackle population stratification, which may exist among the founders and unrelated individuals, principal components analysis (PCA) can be employed to dissect the relative ancestry of individuals. As suggested by Zhu et al(2008), only the founders of the nuclear families and unrelated individuals serve as the reference population for inferring the ancestral composition of the whole sample. We use the PCA-based approach suggested by Price et al(2006) to determine the relative ancestry of each individual. Let = , the marker score for the individual in the family and its realized vector. The correlation matrix of the markers is , where is the overall mean of . For a total of unrelated individuals, there are up to the same number of eigenvalues. Here we only use the largest principal components. Each component is calculated as , where is the eigenvector which represents the corresponding orthogonal axis, or can be interpreted as genealogical relationship among haplotypes(McVean 2009). In our report, we used the top five principal components, which are expected to project the population structure (as demonstrated in Liu et al(2011)). Once the first principal components are calculated, they are also used to calculate the principal components for offspring.

Let be the phenotype for the individual in the nuclear family. A phenotype of interest can be modeled with the generalized linear model using the link function, . Without loss of generality, takes the form of identity for continuous traits, or logit for dichotomous traits. In the following illustration, we only considered continuous traits, but its application to dichotomous ones is straightforward(Lou et al. 2007).

Simulation Scenarios

Systematic simulations were deployed to investigate the power in various scenarios. These scenarios simulated a recent admixed population such as African-Americans. The panel of SNPs with large allele frequency differences across populations was obtained from Smith et al(2004). We used only the autosomal ancestry-informative markers. We specifically sampled 200 markers for which the difference in allele frequency between African and European parental populations is larger than 0.3. In order to generate the admixed population, we used a continuous gene-flow model(Zhu et al. 2006). We created a base population of 10,000 Africans. In every subsequent generation, 250 European ancestral individuals “migrated” into the African population. After inter-mating in each generation, the samples were recruited in the tenth generation, in which ~80% of the genetic contribution was from African origin and ~20% from European ancestry. This simulated genetic constitution of the admixed population is analogous to that of African Americans whose average degree of African ancestry is approximately 75%(Tishkoff et al. 2009). We simulated two independent risk factors for the disease, one modeled on specific genetic factors and the other based entirely on the proportion of one’s African ancestry. For individuals of complete African ancestry, the disease prevalence was 0.1, whereas it was 0.2 if they are of entire European ancestry. However, for each individual, the percentage of African ancestry varies, and we simulated the general risk of an individual is proportional to one’s genetic ancestry, , where ranged from 0 to 1 was indicated by one’s genetic proportion of African ancestry. The risk for disease for the ith individual can be expressed as , in which the first term is an offset determined by one’s degree of admixture and the second term indicates the genotypic risk given genotype for the causative loci. We used , and = 0.46, 0.8, 1.1 and 1.35, respectively, to make the relative risk 1.5, 2.0, 2.5, and 3.0. We did not assume other potential effects, such as familial effects and others, which may also influence the power and type I error rates, but integratingthe correction for those effects into the method is straightforward.

Among the various factors influencing power, study design, disease model, relative risk, and allele frequency, were investigated for their impacts on power in the methods compared here. For the study design, there were four sampling schemes: Design I sampled 200 cases, 200 controls and 200 nuclear families each having a discordant sibpair and two unaffected parents, a total of 1200 individuals; Design II sampled 200 cases, 200 controls, and 200 nuclear families with at least one affected in three children and at least one affected parent, a total of 1400 individuals; Design III sampled 500 cases, 500 controls, and 200 nuclear families with at least one affected in three children and at least one affected parent, a total of 2000 individuals; and Design IV sampled 200 cases, 200 controls, and 320 nuclear families with at least one affected in three children and at least one affected parent, a total of 2000 individuals. For the disease model, under the diallelic setting three digenic disease models were set: of the checkerboard model, the genotypes of 1112, 1211, 1222, and 2212 were set as high-risk ones; of the diagonal model, the genotypes of 1111, 1212, and 2222 were set as high-risk ones; and of the upper-corner model, the genotypes of 1111, 1112, 1211, 1212 were set as high-risk ones. For relative risk, we used four relative risks, which were 1.5, 2.0, 2.5, and 3.0. These three factors (sampling design, disease model, relative risk) generated 48 general combinations, each of which was further bound to random allele frequencies, which followed a uniformed distribution (0.05~0.95) so that the allele frequency is always greater than 0.05 for each locus of a total of 15 loci simulated. So the allele frequency was considered as a nested random factor in each of the 48 combinations and 25 replications of allele frequency were simulated within each of 48 combinations. In total, 1200 scenarios were examined. Each scenario was replicated 500 times, and the statistical power of each scenario was calculated as the proportion of times that the interaction between the pair of loci set was statistically significant in the 500 replicates. When evaluating type I error rate, as the relative risks is one, only four general scenarios under Designs I, II, III, and IV, were simulated.

As described in the method section, PCA was employed to explore the genetic background of the population. The variance-covariance matrix of the marker data was used for eigenvalue decomposition and the linear combination of the first 5 principal components were used to determine the composite , indicating an individual’s African ancestry. Then, was fitted in the linear model to rule out the effect attributed to population stratification. The residuals were used in the various GMDR methods compared in this study. As none of the reference methods compared here have analytical solutions to determine the cutoffs for a given significance level, to compare with these methods on the same basis, 1000 permutations were used to assess the threshold values at alpha = 0.05 and 0.01. Each method may have its own permutation scheme because of a different data structure. In order to control for familial variance in family-based studies, we shuffled the phenotypic values of the sibs in the same family but not a peer in another family. For the case-control and unrelated study, we assumed that every subject was exchangeable with any other, and for the unified method, the unrelated subjects, including the cases, controls, and the founders of each family, were considered as exchangeable.

SAGESample

Majority of SAGE are unrelated samples, in addition to a few families, including, after quality control, a total of 3897 individuals from three subsamples: the CollaborativeStudy on the Genetics of Alcoholism (COGA) (1,178 individuals), the Collaborative Study on theGenetics of Nicotine Dependence (COGEND) (1,427 individuals) andthe Family Study of CocaineDependence (FSCD) (1,292 individuals). Their addiction traits such as alcohol addiction and nicotine dependence were assessed in person by trained research assistants. Using Illumina Human 1M platform,1,069,796 SNP markers were genotyped for each participant. Self-reported ethnicities indicate that about 35% of the participants are black and 65% are white. As self-reported ethnicity often partially reflects one’s genetic ancestral origins, especially for populations that have complicated migration or admixture histories, to investigate the population mixture, the principal components analysis wasrun for the SAGE datain which both unrelated samples and relatives are involved.

Results

Demonstration of the PCA in identifying the ancestry of each individual

We first demonstrate the ability of PCA in identifying the ancestry of each individual. As described in the Methods section, the principal component axes were constructed based on the unrelated individuals, and then all the unrelated individuals and the children were projected to the axes to predict their genetic ancestry. As demonstrated (Supplementary Figures 1 and 2), the first two principal components, predicted the ancestry of each individual very well. Having randomly picked a replication in each of the four designs, we plotted how the genetic ancestry information was captured by PCA. Switching from red to blue corresponds to decrease of African ancestry, and a corresponding increase in European ancestry (Supplementary Figure 1). As a matter of fact, in this simulation study, the first principal component corresponds to ancestry closely, and thus one’s true ancestry is almost linear to the first principal component (Supplementary Figure 2), as it explained almost 95% of the variance in true ancestry in each of the designs adopted. Adding more principal components increased but only slightly. Without loss of generality, we used the top 5 principal components to correct for population stratification (using the top 10 PCs did not result in a difference in power).

Across all the four designs, the type I error rates were controlled well under both 0.05 and 0.01 significance levels (Supplementary Table 1), suggesting that the PCA-based methods and the unified method can control for population structure well. Supplementary Table 2 summarizes the means of the type I error rate over 25 scenarios of allele frequency that were sampled from the uniform distribution on an interval between 0.05 and 0.95. Regardless of the sampled allele frequencies, the type I error rates remained along the expected proportions.

The three factors, study design, relative risk, and disease model, considered in the simulations had significant impact on the power of UI. Supplementary Table 2 provides an overview of the powers of UI according to the factors investigated. We implemented an ANOVA of the power of UI to check whether the factors could perturb power in the range of parameters studied. We conducted the 25 rounds of allele frequency sampling to calculate residual errors, and identified relative risk, study design, and disease model significantly contributed to the change in UI power, with an adjusted of 0.86. Similar to UI, the other three methods also had similar trends.

Acknowledgements

Funding support for the Study of Addiction: Genetics and Environment (SAGE) was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422). SAGE is one of the genome-wide association studies funded as part of the Gene Environment Association Studies (GENEVA) under GEI. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVACoordinatingCenter (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA; U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01 CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse, and the NIH contract "High throughput genotyping for studying the genetic contributions to human disease" (HHSN268200782096C). The datasets used for the analyses described in this manuscript were obtained from the database of Genotypes and Phenotypes (dbGaP) found at through dbGaP accession number phs000092.v1.p.

References

Liu N, Zhao H, Patki A, Limdi NA, Allison DB (2011) Controlling Population Structure in Human Genetic Association Studies with Samples of Unrelated Individuals. Stat Interface 4: 317-326.

Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD (2007) A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet 80: 1125-37.

McVean G (2009) A genealogical interpretation of principal components analysis. PLoS Genet 5: e1000686. doi: 10.1371/journal.pgen.1000686

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904-9. doi: 10.1038/ng1847

Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska A, Kessing BD, Malasky MJ, Scafe C, Le E, De Jager PL, Mignault AA, Yi Z, De The G, Essex M, Sankale JL, Moore JH, Poku K, Phair JP, Goedert JJ, Vlahov D, Williams SM, Tishkoff SA, Winkler CA, De La Vega FM, Woodage T, Sninsky JJ, Hafler DA, Altshuler D, Gilbert DA, O'Brien SJ, Reich D (2004) A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet 74: 1001-13. doi: S0002-9297(07)64364-X [pii] 10.1086/420856

Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, Froment A, Hirbo JB, Awomoyi AA, Bodo JM, Doumbo O, Ibrahim M, Juma AT, Kotze MJ, Lema G, Moore JH, Mortensen H, Nyambo TB, Omar SA, Powell K, Pretorius GS, Smith MW, Thera MA, Wambebe C, Weber JL, Williams SM (2009) The genetic structure and history of Africans and African Americans. Science 324: 1035-44. doi: 1172257 [pii] 10.1126/science.1172257

Zhu X, Li S, Cooper RS, Elston RC (2008) A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet 82: 352-65.

Zhu X, Zhang S, Tang H, Cooper R (2006) A classical likelihood based approach for admixture mapping using EM algorithm. Hum Genet 120: 431-45. doi: 10.1007/s00439-006-0224-z

1

Supplementary materials

Supplementary Table 1. Type I error rates for admixture population

Design I / Design II / Design III / Design IV
Alpha=0.05 / 0.047 / 0.051 / 0.051 / 0.054
Alpha=0.01 / 0.010 / 0.012 / 0.011 / 0.011

Notes: Each type I error rate is averaged over 25 scenarios in which allele frequencies were generated randomly.

1

Supplementary materials

Supplementary Table 2. Analysis of variance of the power of UI

Degrees of freedom / Sum of square / Mean squares / F statistic / p-value
Relative risk / 3 / 162.743 / 54.248 / 2367.126 / <10e-6
Design / 3 / 4.911 / 1.637 / 71.435 / <10e-6
Disease model / 2 / 5.522 / 2.761 / 120.467 / <10e-6
Allele frequency (residuals) / 1191 / 27.294 / 0.023

The residual term is estimated on considering genotype frequency as replications to the 48 combinations generated by relative risk (4 levels), design of experiments (4 levels), and disease models (3 levels).

Supplementary Table 3. Power comparison for admixture population at alpha=0.01

Design / I / II / III / IV
Methods / FAM / CC / UN / UI / FAM / CC / UN / UI / FAM / CC / UN / UI / FAM / CC / UN / UI
RR=1.5
Checkerboard / .001 / 0 / 0 / .003 / .001 / .001 / .002 / .007 / .001 / .002 / .006 / .010 / .002 / .001 / .002 / .007
Diagonal / 0 / .004 / .001 / .003 / .001 / 0 / .001 / .002 / 0 / .001 / .003 / .006 / .001 / 0 / .001 / .003
Upper corner / 0 / 0 / .002 / .001 / 0 / 0 / .001 / .003 / 0 / .001 / .002 / .003 / 0 / 0 / .001 / .005
RR=2.0
Checkerboard / .012 / .006 / .038 / .131 / .027 / .007 / .035 / .190 / .028 / .100 / .228 / .446 / .091 / .006 / .080 / .416
Diagonal / .007 / .006 / .030 / .089 / .010 / .004 / .024 / .092 / .010 / .061 / .134 / .253 / .046 / .006 / .045 / .241
Upper corner / .002 / .003 / .016 / .044 / .006 / .004 / .015 / .0572 / .007 / .050 / .111 / .193 / .016 / .004 / .030 / .132
RR=2.5
Checkerboard / .100 / .076 / .374 / .741 / .210 / .071 / .352 / .810 / .234 / .621 / .844 / .953 / .574 / .074 / .577 / .966
Diagonal / .031 / .044 / .236 / .483 / .118 / .050 / .226 / .573 / .137 / .515 / .752 / .913 / .331 / .047 / .389 / .744
Upper corner / .015 / .042 / .210 / .393 / .039 / .030 / .138 / .334 / .050 / .360 / .542 / .691 / .184 / .034 / .293 / .653
RR=3.0
Checkerboard / .277 / .277 / .805 / .966 / .588 / .283 / .814 / .990 / .615 / .949 / .993 / .999 / .922 / .293 / .950 / .999
Diagonal / .137 / .211 / .677 / .900 / .450 / .226 / .721 / .971 / .395 / .851 / .28 / .964 / .573 / .168 / .691 / .924
Upper corner / .062 / .174 / .573 / .757 / .222 / .146 / .507 / .777 / .194 / .588 / .669 / .746 / .439 / .130 / .634 / .869
Overall power / .054 / .070 / .196 / .376 / .139 / .068 / .236 / .401 / .139 / .342 / .380 / .515 / .265 / .064 / .308 / .497

1

Supplementary materials

Supplementary Table 4. Power comparison between the meta-analysis for siblings and unrelated samples, and the proposed unified method

Model / Design I / Design II / Design III / Design IV
0.05 / 0.01 / 0.05 / 0.01 / 0.05 / 0.01 / 0.05 / 0.01
Checkerboard, RR=1.5 / Meta / 0.012 / 0.007 / 0.016 / 0.007 / 0.029 / 0.0136 / 0.035 / 0.013
UI / 0.015 / 0.008 / 0.022 / 0.008 / 0.050 / 0.0198 / 0.054 / 0.019
Diagonal, RR=2.0 / Meta / 0.095 / 0.042 / 0.17 / 0.079 / 0.345 / 0.209 / 0.327 / 0.185
UI / 0.145 / 0.064 / 0.27 / 0.130 / 0.490 / 0.322 / 0.436 / 0.261
Uppercorner, RR=2.5 / Meta / 0.222 / 0.125 / 0.301 / 0.201 / 0.736 / 0.642 / 0.581 / 0.453
UI / 0.330 / 0.198 / 0.370 / 0.268 / 0.811 / 0.722 / 0.672 / 0.541

Each model was further bound to random allele frequencies, which followed a uniformed distribution (0.05~0.95) of 15 loci simulated. Each model with each of 25 allele frequencies was simulated 200 times each. For the meta-analysis, the combining p-values were determinedby Fisher’s method in which the combined p-value has chi-square distribution with four degrees of freedom. P-values were calculated at alpha=0.05 and alpha=0.01 after Bonferroni correction for 105 digenic interactions given 15 loci.

1

Supplementary materials

Supplementary Table 5. Genotypic frequency distribution for the identified tetragenic model

rs1013402 / rs1044394 / rs2072660 / rs6559840 / Affected / Unaffected / Total
CC / CC / CC / CC / 22 / 46 / 68
CC / CC / CC / TC / 5 / 2 / 7
CC / CC / CT / CC / 155 / 168 / 323
CC / CC / CT / TC / 26 / 32 / 58
CC / CC / CT / TT / 1 / 3 / 4
CC / CC / TT / CC / 184 / 199 / 383
CC / CC / TT / TC / 42 / 48 / 90
CC / CC / TT / TT / 5 / 8 / 13
CC / TC / CC / CC / 25 / 27 / 52
CC / TC / CC / TC / 3 / 1 / 4
CC / TC / CC / TT / 0 / 1 / 1
CC / TC / CT / CC / 114 / 137 / 251
CC / TC / CT / TC / 16 / 26 / 42
CC / TC / CT / TT / 10 / 4 / 14
CC / TC / TT / CC / 142 / 190 / 332
CC / TC / TT / TC / 52 / 38 / 90
CC / TC / TT / TT / 9 / 7 / 16
CC / TT / CC / CC / 6 / 3 / 9
CC / TT / CC / TC / 0 / 1 / 1
CC / TT / CT / CC / 33 / 21 / 54
CC / TT / CT / TC / 9 / 8 / 17
CC / TT / CT / TT / 1 / 2 / 3
CC / TT / TT / CC / 35 / 43 / 78
CC / TT / TT / TC / 18 / 19 / 37
CC / TT / TT / TT / 2 / 4 / 6
TC / CC / CC / CC / 15 / 25 / 40
TC / CC / CC / TC / 5 / 1 / 6
TC / CC / CC / TT / 1 / 0 / 1
TC / CC / CT / CC / 95 / 108 / 203
TC / CC / CT / TC / 17 / 22 / 39
TC / CC / CT / TT / 4 / 0 / 4
TC / CC / TT / CC / 137 / 124 / 261
TC / CC / TT / TC / 42 / 59 / 101
TC / CC / TT / TT / 9 / 6 / 15
TC / TC / CC / CC / 11 / 15 / 26
TC / TC / CC / TC / 6 / 2 / 8
TC / TC / CT / CC / 74 / 79 / 153
TC / TC / CT / TC / 24 / 35 / 59
TC / TC / CT / TT / 7 / 2 / 9
TC / TC / TT / CC / 106 / 128 / 234
TC / TC / TT / TC / 56 / 69 / 125
TC / TC / TT / TT / 14 / 13 / 27
TC / TT / CC / CC / 2 / 1 / 3
TC / TT / CC / TC / 0 / 1 / 1
TC / TT / CC / TT / 0 / 1 / 1
TC / TT / CT / CC / 14 / 24 / 38
TC / TT / CT / TC / 7 / 8 / 15
TC / TT / CT / TT / 1 / 0 / 1
TC / TT / TT / CC / 24 / 31 / 55
TC / TT / TT / TC / 12 / 25 / 37
TC / TT / TT / TT / 2 / 4 / 6
TT / CC / CC / CC / 1 / 4 / 5
TT / CC / CC / TC / 1 / 1 / 2
TT / CC / CC / TT / 1 / 0 / 1
TT / CC / CT / CC / 15 / 24 / 39
TT / CC / CT / TC / 2 / 8 / 10
TT / CC / CT / TT / 0 / 2 / 2
TT / CC / TT / CC / 31 / 27 / 58
TT / CC / TT / TC / 11 / 15 / 26
TT / CC / TT / TT / 2 / 4 / 6
TT / TC / CC / CC / 3 / 3 / 6
TT / TC / CC / TC / 2 / 0 / 2
TT / TC / CC / TT / 1 / 0 / 1
TT / TC / CT / CC / 9 / 21 / 30
TT / TC / CT / TC / 8 / 11 / 19
TT / TC / CT / TT / 2 / 0 / 2
TT / TC / TT / CC / 27 / 37 / 64
TT / TC / TT / TC / 14 / 20 / 34
TT / TC / TT / TT / 7 / 4 / 11
TT / TT / CC / CC / 1 / 0 / 1
TT / TT / CC / TT / 1 / 0 / 1
TT / TT / CT / CC / 7 / 2 / 9
TT / TT / CT / TC / 1 / 1 / 2
TT / TT / CT / TT / 1 / 0 / 1
TT / TT / TT / CC / 6 / 11 / 17
TT / TT / TT / TC / 6 / 6 / 12
TT / TT / TT / TT / 1 / 3 / 4

1