Supplementary Material for“Analysis of Family- and Population-Based Samples in Cohort Genome-Wide Association Studies”

Ani Manichaikul, Wei-Min Chen, Kayleen Williams, Quenna Wong,

Michèle M. Sale, James S. Pankow, Michael Y. Tsai, Jerome I. Rotter,

Stephen S. Rich, Josyf C. Mychaleckyj

Supplementary Methods

Software

Simulation of genotypes and quantitative traits according to the specified polygenic model (Chen and Abecasis 2007) was performed in Merlin (Abecasis et al. 2002), with additional manipulation of phenotypes in R (R Development Core Team 2010). We used PLINK software (Purcell et al. 2007) to perform QFAM-within, QFAM-total, DFAM, linear regression and Trend tests. We used GDT software (Chen et al. 2009) to perform the fastAssoc, robustAssoc, GDT, and MQLS analyses. FBAT (Laird et al. 2000) was employed using default settings in software of the same name. LME and GEE were run using the GWAF (Chen and Yang 2010) package, an R implementation designed specifically for family-based genome-wide association.

Meta-analyses were performed to combine family and unrelated samples in using the software package METAL (Willer et al. 2010), combining p-values taking into account the direction of effect and weighting family and unrelated samples by their total sample sizes.

Principal Components for Analysis of Lipid Phenotypes

Principal components of ancestry were calculated using an LD-thinned subset of the 906,000 SNPs. Because European ancestry typically accounts for 20% of the ancestry of African Americans recruited from large urban US centers, we first removed regions of known long-range linkage disequilibrium in Caucasians from consideration (Price et al. 2008). We then thinned for local LD among 2,630 Caucasian founders, also genotyped through MESA SHARe, using the PLINK (Purcell et al. 2007) option “--indep-pairwise” to create a subset of 110,557 genotyped SNPs thinned for R-squared greater than 0.2 using a 1,500 SNP window. For this subset of SNPs, we performed another round of LD thinning using the MESA African American samples, again setting R-squared of 0.2 as the upper bound for any two SNPs in the same 1,500 SNP window. We then applied EIGENSTRAT (Price et al. 2006) to compute principal components of ancestry using our LD-thinned subset of 99,716 SNPs. Principal components were computed using an unrelated subset of African American samples, and then projected to family-based samples. After examining the Scree-plot for the top 20 principal components, and checking for symmetry in the distribution of loadings for each principal component, we determined that a single principal component of ancestry would provide sufficient adjustment for population stratification in our GWAS of African American samples.

References

Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30: 97-101

Chen MH, Yang Q (2010) GWAF: an R package for genome-wide association analyses with family data. Bioinformatics 26: 580-1

Chen WM, Abecasis GR (2007) Family-based association tests for genomewide association scans. Am J Hum Genet 81: 913-26

Chen WM, Manichaikul A, Rich SS (2009) A generalized family-based association test for dichotomous traits. Am J Hum Genet 85: 364-76

Clopper C, Pearson E (1934) The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26: 404

Laird NM, Horvath S, Xu X (2000) Implementing a unified approach to family-based tests of association. Genet Epidemiol 19 Suppl 1: S36-42

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559-75

R Development Core Team (2010) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Willer CJ, Li Y, Abecasis GR (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26: 2190-1

Wu X, Kan D, Cooper RS, Zhu X (2005) Identifying genetic variation affecting a complex trait in simulated data: a comparison of meta-analysis with pooled data analysis. BMC Genet 6 Suppl 1: S97

Supplementary Table 1: Comparison of computational efficiency for methods of quantitative trait analysis and binary trait analysis included in the family-based simulations.

Analysis Type / Method / Computation Time
Quantitative / QFAM-within / 1 hr 53 min 31.02 sec
FBAT / 21.94 sec
QFAM-total / 17 min 4.79 sec
LME / 2 min 21.94 sec
fastAssoc / 3.04 sec
robustAssoc / 5.90 sec
Linear regression / 2.16 sec
Binary / GDT / 2.50 sec
DFAM / 2.43 sec
MQLS / 6.79 sec
GEE / 1 min 18.96 sec
Trend / 2.12 sec

Reported computation timesare based on analysis of 1000 SNPs. Analysis by QFAM-within and QFAM-total includes permutation to 10,000 replicates using the PLINK option “--mperm”. All computation was performed on an Intel Xeon witha 2.50GHz processor.

Supplementary Figure 1: Comparison of type I error rate and power for quantitative trait analysis, when the minor allele frequency (MAF) is 0.05.

(A) Type I error rate at significance level 0.01, (B) type I error rate at significance level 0.001, and (C) power in quantitative trait analysis of 687 multiplex families. (D) Type I error rate at significance level 0.01, (E) type I error rate at significance level 0.001, and (F) power in quantitative trait analysis 687 multiplex families and 5,922 singletons, with results for analysis of 5,922 singletons alone shown for reference. Uncertainty in point estimates of type I error rates is depicted through 95% confidence intervals constructed by inverting an exact binomial test (Clopper and Pearson 1934).

Supplementary Figure 2: Comparison of type I error rate and power for quantitative trait analysis, when the minor allele frequency (MAF) is 0.01.

(A) Type I error rate at significance level 0.01, (B) type I error rate at significance level 0.001, and (C) power in quantitative trait analysis of 687 multiplex families. (D) Type I error rate at significance level 0.01, (E) type I error rate at significance level 0.001, and (F) power in quantitative trait analysis 687 multiplex families and 5,922 singletons, with results for analysis of 5,922 singletons alone shown for reference. Uncertainty in point estimates of type I error rates is depicted through 95% confidence intervals constructed by inverting an exact binomial test (Clopper and Pearson 1934).

Supplementary Figure 3: Comparison of type I error rate and power for binary trait analysis, when the minor allele frequency (MAF) is 0.05.

(A) Type I error rate at significance level 0.01, (B) type I error rate at significance level 0.001, and (C) power in binary trait analysis of 687 multiplex families. (D) Type I error rate at significance level 0.01, (E) type I error rate at significance level 0.001, and (F) power in binary trait analysis 687 multiplex families and 5,922 singletons, with results for analysis of 5,922 singletons alone shown for reference. Uncertainty in point estimates of type I error rates is depicted through 95% confidence intervals constructed by inverting an exact binomial test (Clopper and Pearson 1934).

Supplementary Figure 4: Comparison of type I error rate and power for binary trait analysis, when the minor allele frequency (MAF) is 0.01.

(A) Type I error rate at significance level 0.01, (B) type I error rate at significance level 0.001, and (C) power in binary trait analysis of 687 multiplex families. (D) Type I error rate at significance level 0.01, (E) type I error rate at significance level 0.001, and (F) power in binary trait analysis 687 multiplex families and 5,922 singletons, with results for analysis of 5,922 singletons alone shown for reference. Uncertainty in point estimates of type I error rates is depicted through 95% confidence intervals constructed by inverting an exact binomial test (Clopper and Pearson 1934).

1