Supplementary Text 1. Comparison of Statistical Models for Genome-Wide Association Scans

Supplementary text 1. Comparison of statistical models for genome-wide association scans.

The Akaike and Schwarz information coefficients (AIC and SIC) are measures of the goodness of fit of an estimated statistical model and are often used to assess the appropriateness of random (and covariance) models in REML. Given a data set, several competing models may be ranked according to their AIC and SIC values, with the one having the lowest AIC or SIC being the best.They are both calculated from the deviance by Genstat(Payne et al. 2010)as follows:

AIC = deviance + 2 × r

SIC = deviance + log(n - p) × r

where n is the total number of useable units in the analysis, r is the number of parameters fitted in the random model (and any covariance models), and p is the number of parameters fitted in the fixed model.

Due to the strong population stratification present in our sample, we used AIC and SIC coefficients as computed by Genstat (Payne et al. 2010)to check whether adding extra structure co-variates as proposed by Yu et al. (2006) had any improvement in the goodness of fit.Also to check whether after adding extra structure co-variates the positive associations remained significant.Two distinct statistical mixed models were assessed for goodness of fit: (1) For the ‘K model’, a relative kinship matrix (K) based on simple matching coefficients was derived from a set of random SNP data using Genstat software(Payne et al. 2010). Markers were fitted as fixed effects,and genotype was fitted as a random effect which is assumed to be distributed as N(0, 2Kg2) where Kis the kinship matrix and σg2 is the genetic variance. (2) For the ‘K + Q model’, the resulting STRUCTURE output matrix of probabilities of belonging to each of k groups when k = 6 (Q) (Comadran et al. 2009)was directly added as co-factor in the random term of the ‘K model’ as described in the literature (Yu et al. 2006). Both models were tested for yield and three yield related traits (grains per spike, thousand kernel weight and heading date) that may be heavily affected by the germplasm stratification.

For yield, grains per spike, thousand kernel weight and heading date adding Q as an extra covariate structure did not have any improvement in the goodness of fit (Table 1). Both statistical models yielded similar AIC and SIC coefficients suggesting that adding the extra Q co-variate in the model does not improve the model. Based on these results we decided to adopt the simplerof the two models.

Akaike information coefficient / Schwarz information coefficient
Yield - K / 5695 / 5833
Yield - K + Q / 5697 / 5841
TKW - K / 21383 / 21535
TKW - K + Q / 21385 / 21544
Heading date - K / 17257 / 17381
Heading date - K + Q / 17253 / 17382
Grains per spike - K / 16363 / 16471
Grains per spike - K + Q / 16366 / 16467

Table 1. Average goodness of fit of Yield, thousand kernel weight and heading date for the ‘K model’ and the ‘K + Q model’ for 1307 SNP markers.

We then explored in detail changes in statistical significance of SNP markers for each model (Fig. 1). In the case of yield, AIC and SIC values for the ‘K + Q model’ were always larger those observed using the ‘K model’. Only 3 out of the 1307 SNP markers tested had marginally higher AIC coefficients for the ‘K model’ compared to the ‘K + Q model’. The three markers affected were SNPs 11_10191, 11_10194, and 11_10243, which co-segregate on chromosome 2H at 63.5 cM. The significance changes have no practical consequences as the three markers were already significant for yield in the ‘K model’ althoughthe ‘K + Q model’ resulted in slight increases in –log10[fp values] from 3.15, 3.0, and 3.0 to 4.1, 3.9 and 3.9 respectively.

Figure 1.–log10[fp value] relationship between the ‘K model’ and the ‘K +Q model’ for yield (A), heading date (B), thousand kernel weigth (C) and grains per spike (D).

In the case of heading date, AIC values for the ‘K + Q model’ were marginally lower than those observed using the ‘K model’. However, the ‘K + Q model’ did not produce significant differences apart from an extra QTL (SNP 11_20394) on chromosome 2H at 33.4 cM with a –log10[fp value] of 3.65. The rice homolog of the positive SNP marker (Os07g49220)is located 24 gene models away from Os07g49460 (rice homolog to Arabidopsis PRR3), gene candidate for HvPpd-H1(Higgins et al. 2010) suggesting that even if we believed that the germplasm was not segregating for HvPpd-H1 variation there may still be extra genetic variation around the locus. All the markers already significant for heading date using the ‘K model’ remained significant with the ‘K + Q model’.

In the case of thousand kernel weight and grains per spike adding Q as an extra covariate structure did not improve the model (in terms of AIC and SIC values), nor did it prevent detection of the effects related to ‘2 row’ and ‘6 row’ ear types.All the markers already significant using the ‘K model’ remained significant with the ‘K + Q model’.

Literature Cited

Comadran, J., W. T. Thomas, F. A. Van Eeuwijk, S. Ceccarelli, S. Grando et al. 2009 Patterns of genetic diversity and linkage disequilibrium in a highly structured Hordeum vulgare association-mapping population for the Mediterranean basin. Theor.Appl.Genet.119: 175-187.

Higgins, J. A., P. C. Bailey, and D. A. Laurie, 2010 Comparative genomics of flowering time pathways using Brachypodium distachyon as a model for the temperate grasses. PLoS.One.5: e10065.

Payne R. W., D. A. Murray, S. A. Harding, D. B. Baird, and D. M. Soutar, 2010 An introduction to GenStat for Windows (13th Edition). VSN International, Hemel Hempstead, UK.

Yu, J., G. Pressoir, W. H. Briggs, B. Vroh, I, M. Yamasaki et al. 2006 A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat.Genet.38: 203-208.