Supplementary Material

Simple Linear Regression for a Quantitative Trait

Consider a quantitative trait locus with two alleles A and a having frequencies and, respectively. Denote the genotypic values of the genotypes AA, Aa and aa by ,and , respectively. A genetic additive effect is given by (Appendix A)

.

Let be a phenotype value of the i-th individual. A simple linear regression model for a quantitative trait is given by (Appendix A)

where

(1)

are independent and identically distributed normal variables with zero mean and variance .

We add a constant into . Then, the indicator variable will be transformed to

(2)

which is a widely used indicator variable in quantitative genetic analysis.

Consider a marker locus with two alleles M and m having frequencies and , respectively. Let D be the disequilibrium coefficient of the linkage disequilibrium (LD) between the marker and the quantitative trait loci. The relationship between the phenotypic value and the genotype at the marker locus is given by

,(3)

where indicator variable is given by

(4)

Assuming the genetic model for the quantitative trait in equation (1), it can be shown that

,

which implies that the estimators at the trait locus is and hence is a consistent estimator.

Assume that the marker is located at the genomic position t and the trait locus is at the genomic position s. Let be the disequilibrium coefficient between the marker and trait loci. Let be the estimated genetic additive effect at the marker locus and be the true additive effect at the trait locus. Let andbe the frequencies of the marker alleles M and m at the genomic position t, respectively. Then, will be almost surely convergent to

. (5)

Suppose that there are K trait loci which are located at the genomic positions. The j-th trait locus has the additive. Then, we have

, (6)

where is the disequilibrium coefficient between the marker at the genomic position t and the trait locus at the genomic position .

Multiple Linear Regression for a Quantitative Trait

Consider the L marker loci which are located at the genomic positions . The multiple linear regression model for a quantitative trait is given by

(7)

where indicator variables are similarly defined as that in equation (4). Let

andbe the frequencies of the alleles and of the marker located at the genomic positions , respectively, be the disequilibrium coefficient of the LD between the marker at the genomic position and the marker at the genomic position . Assume that there are K trait loci s as defined before. Let be the disequilibrium coefficient of the LD between the marker at the genomic position and the trait at the genomic position . By a similar argument as that in Appendix B, we have

, (8)

where ,

.

If we assume that all markers are in linkage equilibrium, then equation (8) is reduced to

,

which is exactly the same as equation (6). In other words, under the assumption of linkage equilibrium among markers, multiple linear regression can be decomposed into a number of simple regressions.

Substituting the above equation into equation (7) and taking limit will lead to the function linear model (10) that will be discussed in the next section.

B-Splines

B-splines are Let the domain be subdivided into knot spans by a set of non-decreasing numbers . The s are called knots. The i-th B-spline basis function of degree p, written as is defined recursively as follows (Ramsay and Silverman 2005):

.

B-Spline Basis functions have two important features:

(1) Basis function is non-zero only on knot spans

(2) Given any knot span , there are at most degree basis functions that are non-zero, namely:

Methods for basis function selection

Basis function selection is carried out by LASSO regression. Consider a sequence of SNPs: with and , Let be a cubic B-spline. Assume that we have the observed data

. A regression model is given by

. (9)

We approximate by a B-spline:

. (10)

The regression spline estimate is obtained by minimizing

, (11)

where . Let , and

.

Then, equation (11) can be reduced to

(12)

A point which is a minimum of function is to satisfy1

The optimization problem (12) can be rewritten as

(13)

Differentiating equation (13) and setting it to be zero, we have

(14)

where , is the j-th column vector of the matrix M,

Thus,

or

.

Therefore,

Algorithms:

Step 1: initialization

Step 2: for

Step 3: for

,

End

If then

Go to step 2

Else

Stop (convergence)

Appendix A: A statistical model for genetic effects

The statistical model for three genotypic values can be expressed as

(A1)

where are the respective deviation of the genotypic values from their expectations on the basis of a perfect fit of the model. We then obtain estimators of and by minimizing

Setting , and , and solving these equations,

we obtain

.

The substitution effect is defined as

,

which is often termed as a genetic additive effect.

Appendix B: The convergence of least square estimator of the regression coefficients

Let , , , , , , . Then equation (3) can be written in a matrix form:

(B1)

The least square estimator of the regression coefficients is given by

(B2)

It follows from equation (4) that

Since

,

by large number theory, we have

(B3)

It follows from equations (3) and (4) that

By the large number theory,

(B4)

Combining equations (B3) and (B4), we obtain

(B5)

When the marker is at the trait locus, we have

.

References

1. FriedmanJ, Hastie T, Höfling, H and Tibshirani R. Pathwise coordinate optimization.Ann. Appl. Stat. 2007;1: 302-332.