STATC141 Spring 2006

Lecture 4, 01/31/2006

Chance Models in Mendel’s Genetics

Mendel’s theory shows the power of simple chance models in action.In 1865, Gregor Mendel published an article which provided a scientific explanation for heredity, and eventually caused a revolution in biology. By a curious twist of fortune, this paper was ignored for about thirty years, until the theory was simultaneously rediscovered by three men, Correns in Germany, de Vries in Holland, and Tschermak in Australia. De Vries and Tschermak are now thought to have seen Mendel’s paper before they published, but Correns apparently found the idea by himself.

Mendels’ experiments were all carried out on garden peas; here is a brief account of one of these experiments. Pea seeds are either yellow or green. (As the phrase suggests, seed color is a property of the seed itself, and not of the parental plant: indeed, one parent often has seeds of both colors.) Mendel bred a pure yellow strain, that is, a train in which every plant in every generation had only yellow seeds; and separately he bred a pure green strain. He then crossed plants of the pure yellow strain with plants of the pure green strain: for instance, he used pollen from the yellows to fertilize ovules on plants of the green strain. (The alternative method, using pollen from the greens to fertilize plants of the yellow strain, gave exactly the same results.) The seeds resulting from a yellow-green cross, and the plants into which they grow, are called “first generation hybrids.” First generation hybrid seeds re all yellow, in distinguishable from seeds of the pure yellow strain. The green seems to have disappeared completely.

These first-generation hybrid seeds grew into first-generation hybrid plants which Mendel crossed with themselves, producing “second-generation hybrid” seeds. Some of these second-generation seeds were yellow, but some were green. So the green disappeared for one generation, but reappeared in the second. Even more surprising, the green reappeared in a definite simple proportion: of the second-generation hybrid seeds, about 75% were yellow and 25% were green.

What is behind this regularity? To explain it, Mendel postulated the existence of the entities now called “genes.” According to Mendel’s theory, there were two different variants of a gene which paired up to control seed color. They will be denoted here by y (for yellow) and g (for green). It is the gene-pair in the seed-not the parent-which determines what the color the seed wil be, and all the cells making up a seed contain the same gene-pair. There are four different gene-pairs: y/y, y/g, g/y, and g/g.Gene-pairs control seed color by the rule:

y/y, y/g and g/y make yellow

g/g makes green.

As geneticists say, y is “dominant” and g is “recessive.” This completes the first part of the model.

Now the seed grows up and becomes a plant; all cells in this plant also carry the seed’s color gene-pair — with one exception. Sex cells, either sperm or eggs, contain only one gene of the pair. For instance, a plant whose ordinary cells contain the gene-pair y/y will produce sperm cells each containing the gene y. On the other hand, similarly, it will produce egg cells each containing the gene y. On the other hand, a plant whose ordinary cells contain the gene-pair y/g will produce some sperm cells containing the gene y, and some sperm cells containing the gene g. In fact, half its sperm cells will contain y, and the other half will contain g; similarly, half its eggs will contain y, and the other half will contain g.

This model accounts for the experimental results. Plants of the pure yellow strain have the color gene-pair y/y, so the sperm and eggs all just contain the gene y. Similarly, plants of the pure green strain have the gene-pair g/g, so their pollen and ovules just contain the gene g. Crossing a pure yellow with a pure green amounts for instance to fertilizing a g-egg by a y-sperm, producing a fertilized cell having the gene-pair y/g. This cell produces itself and eventually becomes a seed, in which all the cells have the gene-pair y/g and are yellow in color. The model has explained why all first-generation hybrid seeds are yellow, and none are green.

What about the second generation? A first-generation hybrid seeds grows into a first-generation hybrid plant, with the gene-pair y/g. This plant produces sperm cell, of which half will contain gene y and the other half will contain the gene g; it also produces eggs, of which half will contain y and the other half will contain g. When two first-generation hybrids are crossed, each resulting second-generation hybrid seed gets one gene at random from each parent—because it is formed by the random combination of a sperm cell and an egg.

Figure 1. Mendel’s chance model for the genetic determination of see-color: one gene is chosen

at random from parent. The chance of each combination is shown. (The sperm gene is listed first;

in terms of seed color, the combinations YG and GY are not distinguishable after fertilization.)

As shown in Figure 1, the seed has a 25% chance to get a gene-pair with two g’s and be green; it has a 75% chance to get a gene-pair with one or two y’s and be yellow. The number of seeds is small by comparison with the number of pollen strains, so the selections for the various seeds are essentially independent. The conclusion: the color of second-generation hybrid seeds will be determined as if by a sequence of draws with replacement from the box

And that is how the model accounts for the reappearance of green in the second generation, for about 25% of the seeds.

Mendel made a bold leap from his experimental evidence to his theoretical conclusions. His reconstruction of the chain of heredity was based entirely on statistical evidence of the kind discussed here. And he was right. Modern research in genetics and molecular biology is uncovering the chemical basis of heredity, and has provided ample direct proof for the existence of Mendel’s hypothetical entities. As we know today, genes are segments of DNA on chromosomes.

Essentially the same mechanism of heredity operates in all forms of life, from dolphins to fruit flies. So the genetic model proposed by Mendel unlocks one of the great mysteries of life. How is it that pea-seed always produces a pea, and never a tomato or a whale? Furthermore, the answer turns out to involve chance in a crucial way, despite Einstein’s quote “I shall never believe that God plays dice with the world”.

An appreciation of the Mendel’s genetic model

Chance models are now used in many fields. Usually, the models only assert that certain entities behave as if they were determined by drawing tickets at random from a box, and little effort is spent establishing a physical basis for the claim of randomness. Indeed, the models seldom say explicitly what is like the box, or what is like the tickets.

The genetic model is quite unusual, in that it answers such questions. There are two main sources of randomness in the model:

  1. the random allotment of chromosomes (one from each pair) to sex cells;
  2. the random pairing of sex cells to produce fertilized egg.

Did Mendel’s facts fit his model?

Mendel’s discovery ranks as one of the greatest in science. Today, his theory is amply proved and extremely powerful. But how good was his own experimental proof? Did Mendel’s data prove his theory? Only too well, answered by R.A. Fisher. Mendel’s observed frequencies were uncomfortably close to is expected frequencies, much closer than ordinary chance variability would permit.

In one experiment, for instance, Mendel obtained 8,023 second-generation hybrid seeds. He expected 1/4*8023=2006 of them to be green, and observed 2001, for a discrepancy of 5. According to his own chance model, about 88% of the time, chance variation would cause a discrepancy between Mendel’s expectations and his observations greater than the one he reported. By itself, this evidence is not very strong. The trouble is, every one of Mendel’s experiments shows this kind of unusually close agreement between expectations and observations. Using the -test to pool the results, Fisher showed that the chance of agreement as close as the reported by Mendel is about four in a hundred thousand.

The Chi-square test

This test helps to evaluate the deviation of observed values from expected values.

.

Degree of freedom = number of terms in — one.

With independent experiments, the results can be pooled by adding up the separate chi-square-statistics; the degrees of freedom add up too.

Example 1. One of Mendel’s breeding trials came out as follows.

For this data, =0.5, the degree of freedom = 4 -1 =3, p-value = 8%, whichis inconclusive, but points to fudging. If we observe this kind of independent experiments 5 times with all similar chi-square values, then the chi-square statistic for the pooled data will be around 2.5 with degree of freedom 15. Then the p-value is about 0.00013.

Mendelian Concepts

Every diploid organism has two copies of each genetic locus carried on pairs of autosomes (chromosomes other than sex chromosomes). A locus is an identifiable region on a chromosome, and it may correspond to a gene or to a physical marker such as a sequence-tagged site (STS). The two gene copies corresponding to a particular locus in an organism may or may not be exactly identical. During meiosis, alleles corresponding to a particular locus segregate, which means that one copy of any locus appears in any given gamete. In contrast, two different genes on the same chromosome do not segregate unless recombination has occurred.

If two alleles at a given locus are identical in an individual, then that individual is said to be homozygous for the genes at that locus. If the two alleles are different, then the individual is heterozygous with respect to the genes at that locus. Sometimes the phenotype associated with a gene fails to appear because of the particular constellation of other genes in that individual or particularly environmental circumstance. The probability that a gene confers the phenotype associated with it is called its penetrance.

Linkage Disequilibrium (LD)

The materials on this topic will be distributed in class.

The materials are from pages 381-392, Computational Genome Analysis: An Introduction. Deonier RC, Tavar´e S & Waterman MS. Published by Springer Verlag, New York. 540pp. ISBN: 0-387-98785-1.

Probabilities and Statistics in Shotgun Sequencing

Shotgun Sequencing.

Before any analysis of a DNA sequence can take place it is first necessary to determine the actual sequence itself, at least as accurately as is reasonably possible. Unfortunately, technical considerations make it impossible to sequence very long pieces of DNA all at once. Current sequencing technologies allow accurate reading of no more than 500 to 800bp of contiguous DNA sequence. This means that the sequence of an entire genome must be assembled from collections of comparatively short subsequences. This process is called DNA sequence “assembly”.

One approach of sequence assembly is to produce the sequence of a DNA segment (called as a “contig”, or perhaps a genome) from a large number of randomly chosen sequence reads (many overlapping small pieces, each on the order of 500-800 bases). One difficulty of this process is that the locations of the fragments within the genome and with respect to each other are not generally known. However, if enough fragments are sequenced so that there will be many overlaps between them, the fragments can be matched up and assembled. This method is called “shotgun sequencing.”

Shotgun sequencing approaches, including the whole-genome shotgun approach, are currently a central part of all genome-sequencing efforts. These methods require a high level of automation in sample preparation and analysis and are heavily reliant on the power of modern computers. There is an interplay between substrates to be sequenced (genomes and their representation in clone libraries), the analytical tools for generating a DNA sequence, the sequencing strategies, and the computational methods. The key underlying determinant is that we can obtain high-quality continuous sequence reads of up to 500 to 800 bases with current technology. This represents a tiny fraction of either a prokaryotic or eukaryotic genome. The problem in large measure is defined by the need to assemble a larger whole from a large number of small parts. Therefore, a large number of randomly generated sequence reads should be used to generate sequences at appropriately large levels of “sequence coverage”

Coverage Theorem.

Sequence coverage is the average number of times any given genomic base is represented in sequence reads.

Definition--It is customary to say that a-times coverage (or aXcoverage) is obtained if, when the length of original longsequence is G, the total length of the fragments sequenced is aG.

Theorem 1--Assuming that there are N fragments of length L each and the length of original long sequence is G, the coverage is a=NL/G. Then in order for the mean proportion of the original long sequence covered by at lease one fragment to be 0.99, it is necessary to have at least 4.6X coverage.

Before proving the above theorem, let us see the distribution of the location of left-hand end of fragments. If we assume that the fragments are taken at random from the original full-length sequence, so that, ignoring end effects, the position of the left-hand end of any fragment is uniformly distributed in (0, G), and thus falls in an interval (x, x+L) on the original sequence with probability L/G. The number of fragments whose left-hand end falls in this interval has a binomial distribution with mean NL/G. If N is large and L is small, the discussion on “Poisson is Ultimate Binomial” above (1) shows that this distribution is approximately Poisson with mean NL/G.

Proof of theorem 1: The mean proportion of the genome covered by one or more fragments is the probability that a point chosen at random is covered by at least one fragment. This is the probability that at least one fragment has its left-hand end in the interval of length L immediately to the left of this point, which is approximately . In order for to be 0.99 it is necessary to have NL/G=4.6, so that the sum of the fragment lengths is then not quite five times the genome length.

Note that since the human genome is approximately nucleotides, 4.6X coverage will still miss approximately 30,000,000 of them.

Mean number of contigs

Each contig has a unique rightmost fragment, so that the mean number of contigs is the number of fragments N multiplied by the probability that a fragment is the rightmost member of a contig. This latter probability is the probability that no other fragment has its left-hand end point on the fragment in question. From Poisson probability function, this probability is. Thus mean number of contigs =.