Supplementary Material for Soranzo et al (2004)

Michael E. Weale, Biology Department, UCL

Section I: Bayesian Assessment of the Boundaries of an Associated Interval

Background to an Associated Interval:

Here I formalise the idea of an associated interval as defined in Goldstein (2003). The idea applies to a situation in which a polymorphism has been associated with a phenotype of interest in one dataset (“phenotyped” dataset). This polymorphism is referred to as the “associated variant”, M. In a separate dataset (“LD” dataset) there is information about the pattern of association between a set of linked SNPs and the “associated variant”. These two sources of information are combined to assess which markers, and the sequence stretch that they delimit, should be considered as candidates for causing the original genotype-phenotype association. This stretch of sequence is the “associated interval” and should be exhaustively searched for causal variants. I note that given the imminent availability of a relatively dense set of LD genotype data stemming from the HapMap project, this situation will soon be very common. A variant will be associated in a “phenotyped” dataset, and typing it in the same individuals used in the HapMap project will provide data on the association between the associated variant and linked polymorphisms. The approaches described here will allow estimation of the associated interval in this context.

The figure describes the two Bayesian models which are applied separately to each SNP i in the LD dataset. Model 0 proposes that SNP M, the “associated variant”, is causal. Model 1 proposes that SNP i is causal. Under Model 0, the phenotypic data P in a set of nY phenotyped individuals is explained directly by the known genotype information, G’M, regarding SNP M, via an association model described by parameters in b (call this set b0 for Model 0). Under Model 1, P is explained indirectly by linkage of M to SNP i (determined by f, the vector of four two-locus haplotype frequencies) and then by the association model between SNP i and phenotype described by parameters in b1. Information on f is obtained from the LD dataset, G, which contains genotype information on M and SNP i in a sample of nG individuals. G is allowed to contain phase-resolved information on heterozygotes, if available from, for example, typing family members.

Many different forms of the association model, described by the parameters in b, are possible (see later). One such model, used to generate Figure 1(c) in the paper, is a logistic regression model appropriate for case-control data. It is assumed that the genotypic odds ratio is greatest between the two homozygous genotypes – denoted by parameter θ. The odds ratio between the heterozygous genotype and the reference homozygote is given by θc, where c is a dominance modifier taking a value between 0 and 1 (½ indicates no dominance). While c was given an uninformative uniform prior in the paper by Soranzo et al., an informative prior was applied to θ, based on the emerging picture of the strength of clinical association found for variants implicated in complex diseases. To date the strength of association, even for the most implicated variants such as APOE4 and BRCA1, as not exceeded θ=5. A Normal prior was therefore applied to loge(θ), centred on zero, with 2.5% and 97.5% quantiles of ±loge(5).

WinBUGs (Spiegelhalter et al. 2003) was used to find joint posterior distributions of f and b via Markov chain Monte Carlo approximation. The associated interval was assessed by estimating the posterior odds ratio: P(Model 1|Data) / P(Model 0|Data). Equal prior weight was given to each model, and so estimate the posterior odds as the Bayes Factor: P(Data|Model 1) / P(Data|Model 0).

Implementation of a Bayesian Assessment of the Associated Interval:

In the following sections, I first outline (in Sub-section 1) the method for estimating the joint posterior distribution of f, as this can be used separately to the issue of assessing an Association Interval. For example, it can be used (see Figure 1(b) in the paper) to obtain posterior credible intervals for pairwise LD measures such as r2.

In Sub-section 2, I describe in more detail the method for assessing the Associated Interval. In this instance, I explain the implementation of an association model appropriate for an intermediate phenotype such as protein expression level, in which a simple Normal error model is assumed.

In Sub-section 3, I describe how the same method can be applied to clinical case-control data, using the genotypic data to explain the phenotypic data via the usual logistic regression model.

Finally, in Sub-section 4, I describe how case-control data can also, if desired, be treated by using the phenotypic data to explain the genotypic data. This reversal of the conditioning is a more proper description of case-control data, since case and control numbers are usually fixed whereas genotypes are determined subsequently. It is well known that maximum likelihood estimators are the same regardless of the direction of conditioning. In Bayesian estimation there are slight differences, but in our experience we have found these to be trivial. For computational reasons, we therefore prefer the method of Sub-section 3, but we present the alternative method here for completeness.

Each section has a corresponding *.odc file that can be used to run the analysis in WinBUGS ver1.4. WinBUGS ver1.4 is currently available for free download from http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml. Each *.odc file can be viewed from within WinBUGS and contains further information on how to run the analysis. The *.odc files and other information can be found at http://popgen.biol.ucl.ac.uk/software.html.


1) Bayesian estimation of 2-SNP-locus haplotype phase frequencies

There already exist a number of Bayesian methods for haplotype phase estimation and so the theory is already fairly well established. Nevertheless, it will be useful to describe the approach used here, as there are some features which can separate the different methods used. This method allows partially missing data and family data (for resolution of phase of some heterozygotes).

Outline of method

Let G be a matrix of two columns, GM and GS, containing the genotype states at a diallelic marker locus M and other SNP locus S respectively, for each individual in a sample of nG unrelated individuals. The probability P(GMi,GSi|f) of observing a given genotype GMi,GSi in the ith individual, given the vector f of four haplotype frequency parameters, is given by the table in the technical details section below.

The overall likelihood of the data is:

if conditioned on the ordering of individuals, or

if unconditioned on the ordering of individuals.

In either case, the likelihood can be used to find the posterior distribution of f when combined with some prior. We chose to use employ a flat non-informative Dirichlet(1,1,1,1) prior. The posterior distribution of f is found by use of a Gibbs Sampler in WinBUGS, from which the posterior distribution of derived LD statistics such as r2 can also be found. Further details of the WinBUGS program can be found in the file “Phayes.odc”.

Technical details

It is useful to start by coding all genotypes at all SNP loci found in a sample of unrelated individuals as follows:

“0” - A/a genotype (phase unknown) (aka 0/1 genotype)

“1” - A/A genotype (aka 1/1 genotype)

“2” - a/a genotype (aka 0/0 genotype)

“3” - ?/? genotype (completely missing data)

“4” - A/? genotype (partially missing data)

“5” - a/? genotype (partially missing data)

“6” - AT/a genotype (A/a genotype with phase known, A allele “transmitted”)

“7” - A/aT genotype (A/a genotype with phase known, a allele “transmitted”)

The above coding is the same as that used by the PLEM phase program (http://www.people.fas.harvard.edu/~junliu/plem/), with the addition of two extra codes “6” and “7” to allow for resolved heterozygotes determined from genotyping other family members (e.g. trios). Here the AT/a genotype means that allele A is “transmitted”, meaning it rests on the chromosome transmitted to the child in a trio study. However, for our purposes “transmitted” is only a label to distinguish one chromosome from the other, so that a AT/a genotype in two different loci on the same individual means that the two A alleles rest on the same chromosome (and likewise the two a alleles rest on the other).

The function “MtoPLEM.m” in TagIT allows the conversion of trio data to the above format, including resolution of heterozygote phase and filling in of missing data inferred from other trio members.

I define the elements of f, the four 2-locus haplotype population frequencies, as:

f1 = Freq. of a_a haplotype

f2 = Freq. of a_A haplotype

f3 = Freq. of A_a haplotype

f4 = Freq. of A_A haplotype

Some algebra results in the following table of expected frequencies of all 64 possible diplotypes, made up of the genotypes present at each of two loci, using the coding given above.

Code
Loc2 / “0” / “1” / “2” / “3” / “4” / “5” / “6” / “7”
Code
Loc1 / A/a / A/A / a/a / ?/? / A/? / a/? / AT/a / A/aT
“0” / A/a / 00 11 01 10
11 00 10 01
2(f1f4+f2f3) / 01 11
11 01
2f2f4 / 00 10
10 00
2f1f3 / 2(f1+f2)
(f3+f4) / 2f2f4+
f1f4+f2f3 / 2f1f3+
f1f4+f2f3 / 00 10
11 01
f1f4+f2f3 / 01 11
10 00
f1f4+f2f3
“1” / A/A / 10 11
11 10
2f3f4 / 11
11
f42 / 10
10
f32 / (f3+f4)2 / f4(f3+f4) / f3(f3+f4) / 10
11
f3f4 / 11
10
f3f4
“2” / a/a / 00 01
01 00
2f1f2 / 01
01
f22 / 00
00
f12 / (f1+f2)2 / f2(f1+f2) / f1(f1+f2) / 00
01
f1f2 / 01
00
f1f2
“3” / ?/? / 2(f1+f3)(f2+f4) / (f2+f4)2 / (f1+f3)2 / 1 / ( f2+f4) / ( f1+f3) / (f1+f3)(f2+f4) / (f1+f3)(f2+f4)
“4” / A/? / 2f3f4+f1f4+f2f3 / f4(f2+f4) / f3(f1+f3) / ( f3+f4) / f4 +
(f2f3-f1f4)/2 / f3 +
(f1f4-f2f3)/2 / f3f4+
(f1f4+f2f3)/2 / f3f4+
(f1f4+f2f3)/2
“5” / a/? / 2f1f2+f1f4+f2f3 / f2(f2+f4) / f1(f1+f3) / ( f1+f2) / f2 +
(f1f4-f2f3)/2 / f1 +
(f2f3-f1f4)/2 / f1f2+
(f1f4+f2f3)/2 / f1f2+
(f1f4+f2f3)/2
“6” / AT/a / 00 01
11 10
f1f4+f2f3 / 01
11
f2f4 / 00
10
f1f3 / (f1+f2)
(f3+f4) / f2f4+
(f1f4+f2f3)/2 / f1f3+
(f1f4+f2f3)/2 / 00
11
f1f4 / 01
10
f2f3
“7” / A/aT / 10 11
01 00
f1f4+f2f3 / 11
01
f2f4 / 10
00
f1f3 / (f1+f2)
(f3+f4) / f2f4+
(f1f4+f2f3)/2 / f1f3+
(f1f4+f2f3)/2 / 10
01
f2f3 / 11
00
f1f4

For some cells, I have elucidated all the ways of making phased 2-locus diplotypes that match the observed genotypes. Diplotypes are made up of 4 numbers, 2 above and 2 below, where “1” represents the A allele and “0” represents the a allele. In each diplotype, the ones on the left are for “Loc1” and on the right are for “Loc2”. Diplotypes are ordered so that the bottom row represents (arbitrarily) the chromosome transmitted to a 1st child and the top row is the chromosome not transmitted (if this is not trio data but phase is known by some other means, one can arbitrarily call one chromosome “A” and the other “B”). Some cells have too many possible diplotypes, so these are not elucidated.

For all cells, I give the probability of seeing that combination based on the haplotype frequencies: f1, f2, f3 and f4, assuming random mating. For combinations without missing data, this is the expected frequency of that combination in the population. For combinations involving missing data, this is the expected frequency of that combination among all combinations showing the same number (in respective loci) of “?” values as observed here, assuming that there is no effect of allele, genotype or haplotype on the occurrence of missing data. Thus in the later likelihood calculations, the data are split into separate categories based on the number of missing “?” entries they have, and the likelihood is the product of probabilities of observing the data conditional on the given fixed total number of observations within each category.

For further explanation of the probabilities, start with the blue cells. This marks cases where the haplotypes on both chromomes are known with certainty, so the probability is simply the product of the two haplotype frequencies. Next consider the green cells. These are for cases where the phase of A/a heterozygotes is uncertain, so any entry in, say, the “0” column is found by adding the 2 (equal) probabilities in the “6”+”7” column, and likewise for rows. Next, the entries in the “3” column are found by summing over all entries in the “0”+”1”+”2” columns. These entries are also intuitively obvious as the HWE frequencies found from the allele frequencies of the other locus (the one without missing data). Finally, the entries in the “4” column equal the “1” column plus ½ the “0” column, while the entries in the “5” column equal the “2” column plus ½ the “0” column (because if you stick a “?” at random on a heterozygote, it has a 50:50 chance of coming out either a A/? or a a/?). The entries for the intersection of columns “4” and “5” and rows “4” and “5” are found in the same way, after some algebraic simplification.