Here S a Summary of Files You Need to Submit

Problem Set 4

(Due Nov. 11th, Tuesday, 8pm EST)

Please make sure to show your work and calculations and state any assumptions you make in answering the following questions. Include the names of the people you worked with at the top of your problem set.

Here’s a summary of files you need to submit:

If your name is John Harvard and you’re in Lan Zhang’s section:

JohnHarvard_ps4_LZ.doc

JohnHarvard_ps4_p1_LZ.pl

JohnHarvard_ps4_p1complete_LZ.out

JohnHarvard_ps4_p1single_LZ.out

JohnHarvard_ps4_LZ.cdt

JohnHarvard_ps4_LZ.atr

1. Clustering (37 pts total)

Microarray and DNA chip technologies have made it possible to study expression patterns of thousand of genes simultaneously. The amount of data coming out of these efforts is overwhelming. A powerful strategy for analysis of these large-scale data is the clustering of expression profiles. Expression profiles can be clustered by gene or by condition. Sørlie et al. classified breast carcinomas based on gene expression patterns derived from cDNA microarrays (PNAS, 98, 10869-10874). In this problem, you will be analyzing the same data using different clustering algorithms.

1.1Describe the two major goals of this paper in one sentence. (2 pts)

To classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. (Quoted from the abstract).

1.2 In this paper, the breast carcinomas were clustered using a hierarchical clustering algorithm. Your first assignment is to write a Perl program to implement variations of this algorithm. A brief summary of the hierarchical clustering algorithm you are asked to implement can be found here. The data you are going to analyze contain the expression profile of an intrinsic set of 456 genes in 85 breast samples (download data). Use the correlation coefficient as the distance metric and do both single-linkage and complete-linkage clustering. Use this template (PS4_p1_template_2003.pl) for your program. Name your program as FirstnameLastname_ps4_p1_TFinitials.pl. Note that your program needs to take the same command line arguments as described in the template, otherwise points will be deducted from your score. (20 pts total)

See PS4_answer_p1_2003.pl

1.2.1 Run your program on the dataset to generate 2 clusters of breast samples using the complete-linkage (farthest neighbor) correlation coefficient metric. Provide your clustering resultsin a separate file named FirstnameLastname_ps4_p1complete_TFinitials.out(List the members of each cluster according to this format).

See PS4_answer_p1complete_2003.out

1.2.2 Run your program on the dataset to generate 5 clusters of breast carcinoma samples using the single-linkage (nearest neighbor) correlation coefficient metric. Provide your clustering results in a separate file named FirstnameLastname_ps4_p1single_TFinitials.out(List the members of each cluster according to this format).

See PS4_answer_p1single_2003.out

Partial credits are given for the following tasks:

Reading input data (3 pts)
Constructing distance matrix (3 pts)
Common mistakes:

Incorrect handling of missing data points (most common mistake: not eliminating a gene from the calculation for an individual sample pair which has missing data for one or both points for that gene). This usually, results in slightly inaccurate values (on the order of ~0.01-0.001) and is usually enough to throw off the clustering results. 1 point penalty.
Incorrect correlation formulas. 1-2 point penalty, depending on the severity of the error.

Updating distance matrix (8 pts, 4 pts for complete linkage, 4 pts for single linkage)

Updating the distance matrix is actually a bit more complicated than most people realized. Below is an example of a distance matrix. Here, the smallest distance, 0.0013, corresponds to a sample pair 6, 2.

0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8
0 / 0 / 0.9057 / 0.4063 / 0.6075 / 0.3189 / 0.5741 / 0.2409 / 0.227 / 0.2918
1 / 0.9057 / 0 / 0.7803 / 0.1398 / 0.7844 / 0.9478 / 0.1433 / 0.3451 / 0.142
2 / 0.4063 / 0.7803 / 0 / 0.7384 / 0.3789 / 0.1487 / 0.0013 / 0.8273 / 0.5882
3 / 0.6075 / 0.1398 / 0.7384 / 0 / 0.1088 / 0.1974 / 0.0898 / 0.3803 / 0.5151
4 / 0.3189 / 0.7844 / 0.3789 / 0.1088 / 0 / 0.9341 / 0.5994 / 0.2175 / 0.882
5 / 0.5741 / 0.9478 / 0.1487 / 0.1974 / 0.9341 / 0 / 0.9018 / 0.3313 / 0.0746
6 / 0.2409 / 0.1433 / 0.0013 / 0.0898 / 0.5994 / 0.9018 / 0 / 0.8059 / 0.0222
7 / 0.227 / 0.3451 / 0.8273 / 0.3803 / 0.2175 / 0.3313 / 0.8059 / 0 / 0.0488
8 / 0.2918 / 0.142 / 0.5882 / 0.5151 / 0.882 / 0.0746 / 0.0222 / 0.0488 / 0

The straightforward method is to simply take a pair of rows and compress them together, and then to take a pair of columns, and then compress them together, while ignoring cases (gray boxes) where the ith or jth index corresponds to one of the sample pairs (in this case, 2 or 6). Many people updated their distance matrices as if the table is fully fleshed out (both black and gray values present), and if it was, then the distance matrix would be updated correctly.

However, frequently, people constructed their distance matrix such that the values in gray text are not present in their @distance variable. This would result in “Use of uninitialized variable” errors. And if people tried to compensate for this, they often didn’t take into account the added complexities of updating the distance matrix if the gray values were missing.

Picking the correct distance pairs in such case means flipping the coordinates of the distance pair at the correct places. The following is worked out for the example matrix shown above (complete linkage):

if ($cidx1, $cidx2) = (6, 2) / $i / Dist. 1 / Dist. 2 / Conditions
max of: ($cidx1, $i) and ($cidx2, $i) / 0 / (6, 0) / (2, 0) / ($i < smaller $cid)
max of: ($cidx1, $i) and ($cidx2, $i) / 1 / (6, 1) / (2, 1) / ($i < smaller $cid)
SKIP if $i = $cidx2 + 1 / 2 / (6, 2) / (2, 2) / ($i = smaller $cid)
max of: ($cidx1, $i) and ($i, $cidx2) / 3 / (6, 3) / (3, 2) / (smaller $cid < $i < larger $cid)
max of: ($cidx1, $i) and ($i, $cidx2) / 4 / (6, 4) / (4, 2) / (smaller $cid < $i < larger $cid)
max of: ($cidx1, $i) and ($i, $cidx2) / 5 / (6, 5) / (5, 2) / (smaller $cid < $i < larger $cid)
SKIP if $i = $cidx1 + 1 / 6 / (6, 6) / (6, 2) / ($i = larger $cid)
max of: ($i, $cidx1) and ($i, $cidx2) / 7 / (7, 6) / (7, 2) / ($i > larger $cid)
max of: ($i, $cidx1) and ($i, $cidx2) / 8 / (8, 6) / (8, 2) / ($i > larger $cid)

The results of this is shown below.

0 / 1 / 2, 6 / 3 / 4 / 5 / 7 / 8
0 / 0 / 0.9057 / 0.4063 / 0.6075 / 0.3189 / 0.5741 / 0.227 / 0.2918
1 / 0.9057 / 0 / 0.7803 / 0.1398 / 0.7844 / 0.9478 / 0.3451 / 0.142
2, 6 / 0.4063 / 0.7803 / 0 / 0.7384 / 0.3789 / 0.1487 / 0.8273 / 0.5882
3 / 0.6075 / 0.1398 / 0.7384 / 0 / 0.1088 / 0.1974 / 0.3803 / 0.5151
4 / 0.3189 / 0.7844 / 0.5994 / 0.1088 / 0 / 0.9341 / 0.2175 / 0.882
5 / 0.5741 / 0.9478 / 0.1487 / 0.1974 / 0.9341 / 0 / 0.3313 / 0.0746
7 / 0.227 / 0.3451 / 0.8273 / 0.3803 / 0.2175 / 0.3313 / 0 / 0.0488
8 / 0.2918 / 0.142 / 0.5882 / 0.5151 / 0.882 / 0.0746 / 0.0488 / 0

Usually, 2-4 points are taken off for incorrectly updated distance matrices.

Also, points for problems with the clustering process are taken off here (usually 2 points).

Output clustering result (6 pts)

Points were awarded pretty liberally for this – even if the clustering results were wrong, full credit could still be awarded if the data outputted contained the correct number of clusters and samples. Points were deducted under the following situations:

-5 if no output files were submitted (even if the program can output clusters)
-1 if the format was incorrect (note: points were not deducted if the samples were identified as “sample1, sample2, … etc.”)
-4 if both output files submitted were different from the output files generated by the program, -2 if just one of the output files was different. This is usually due to incorrect command line arguments or output from an earlier version of the submitted script.

Additional point penalties independent of the components of the program are applied on top of the individual component scores, as follows:

-6 if clustering was performed by genes (data rows) instead of by samples (arrays, or data columns). The size of the penalty reflects the importance of the whole point of the exercise, which should have been made clear by the reading of the paper and by the answer to 1.1.
-2 if the program didn’t compile. It is much more difficult to award partial credit for programs that don’t compile.
-3 if the program doesn’t take command line arguments. The fact that a penalty will be levied for such cases is clearly stated in the directions.

(the lowest possible score, however, will not be below 0/20.)

Small bugs usually have 1-2 points taken off and are considered part of the individual component scores.

1.3 A cluster analysis and visualizationsoftware originally written by Michael Eisen and updated by Michiel de Hoon can be downloaded here. Please read the manual and then use it to do hierarchical clustering on the same dataset using the correlation coefficient (uncentered) distance metric and the centroid-linkage clustering method. Provide the clustering results you obtain from the software (submit the .cdt and .atr files, and name them as FirstnameLastname_ps4_TFinitials.cdt and FirstnameLastname_ps4_TFinitials.atr.). (5 pts)

See PS4_answer_2003.cdt and PS4_answer_2003.atr (use the TreeView program embedded in Cluster 3.0 to check the clustering results).

Missing or misnamed files have 1-3 points taken off. Incorrect output is a 3 point penalty.

1.4 This clustering software also offers several other clustering methods. Now use this software to analyze the same dataset with the k-means algorithm. (6 pts total)

1.4.1 How many clusters do you want to use? Explain how you decided on this number according to the original paper (in no more than 20 words). (2 pts)

Six clusters. The 85 breast samples were classified into six groups by hierarchical clustering, using the intrinsic gene set of 456 cDNA clones.

1.4.2 Execute the clustering algorithm three times, using 10, 100, 1000 as the number of runs, respectively.Compare the results. Do you get the same results with different executions? Give the reason why this is happening (in one sentence). [Hint: read page 18 to 19 in the manual.] (2 pts)

No. The initial assignment of items to clusters is random and hence different runs of the k-means clustering algorithm may not give the same final clustering solution; the solution with the smallest sum of within-cluster distances is reported, which varies from one execution to another for this problem.

Many people forgot to mention that the initial assignment was random (1 point penalty).

1.4.3 Provide the clustering results you obtain in one execution (using 1000 as the number of runs), provide the number of time(s) the solution is found and paste below the contents of the .kag file. (2 pts)

(Note: the solution is not unique and following is an example.)

Number of times the solution was found: 1.

Clustering results in the .kag file:

ARRAYGROUP

BC807A-BE0

BC116A-BE0

BC118B-BE0

BC121B-BE0

BC124A-BE0

BC201B-BE0

BC104A-BE0

BC210B-AF0

BC308B-BE0

BC106B-BE0

BC405A-BE0

BC503B-BE0

BC108A-BE0

BC608B-BE0

BC610A-BE0

BC702B-BE0

BC704B-AF0

BC-HBC30

BC-HBC4-T10

BC240

BC7901

BC709B-BE1

BC114A-BE1

BC123B-BE1

BC206A-BE1

BC213B-BE1

BC305A-BE1

BC309A-BE1

BC406A-2ndTUMOR1

BC605B-BE1

BC-HBC61

BC21

BC441

BC451

NormBrst12

NormBrst22

NorBreast32

BC110B-BE2

BC708B-BE2

BC711B-BE2

BC808A-BE2

BC117A-BE2

BC307B-BE2

BC402B-BE2

BC406A-BE2

BC-HBC22

NorwNorBst12

BC11-FA2

BC172

BC37-FA2

BC12573

BC125A-BE3

BC706A-BE3

BC183

BC20-FA3

BC31-03

BC35-03

BC383

BC4-LN43

BC403

BC63

BC710A-BE4

BC111A-BE4

BC112B-BE4

BC115B-BE4

BC111B-BE4

BC120A-BE4

BC214B-BE4

BC303B-BE4

BC107B-BE4

BC713A-BE4

BC164

BC-A4

BC13695

BC805A-BE5

BC119A-BE5

BC205B-AF5

BC208A-BE5

BC404B-BE5

BC606B-AF5

BC-HBC55

BC145

BC235

BC46-LN465

BC48-05

1.5 How would you determine mathematically the “goodness” of a cluster? How about biologically? Each in one single sentence please. (4 pts)

Mathematically: small within-cluster variation, large between-cluster distance. (2pts)

Many people mentioned small within-cluster variation but forgot to mention large between-cluster distance (1 point penalty).

Biologically: members of clusters share biological functions, share motifs (for genes) or share biological characteristics (for samples). (2pts)

2: Motif searching and functional enrichment (30 pts total)

You will need to read the following paper by Tavazoie et al. to answer the next part: Tavazoie et al., Systematic determination of genetic network structure. Nature Genetics 22:281-5.

2.1 Read the Tavazoie et al. paper, and answer the following questions. (14 pts total)

2.1.1 With reference to Table 1, what is the most likely function of a gene that appears in cluster 7? (2 pts)

Table 1 shows that cell-cycle control and mitosis is the most common functional category in cluster 7 with 312 total ORFs. Thus, this is the most likely functional category for an unknown gene in this cluster.

2.1.2 At what stage in the cell cycle do you think a gene in cluster 7 would have its peak expression levels? (2 pts)

Figure 1 illustrates that cluster 7 has its peak expression at M phase.

2.1.3 The periodicity index of a cluster is a measure of how close the expression profile of that cluster is to a pure frequency of 0.0125 min-1. In no more than 30 words, explain the significance of this frequency and how the authors determined its value. (4 pts)

The authors wanted a frequency that matched the cell-cycle. Two well-studied periodic transcripts, CLN1 and CLN2, are known to have cycles of about 80 minutes, or a frequency of 0.0125 min-1.

2.1.4 Define the term “false positive” in one sentence. (2 pts)

A false positive is when a test incorrectly identifies a case as being significant.

2.1.5 In this study, 199 MIPS functional categories were tested for each cluster, and a p-value threshold of 3x10-4 was used to determine which functional categories were highly enriched. If a p-value threshold of 0.05 were used instead, about how many functional categories would you expect to be incorrectly labeled as being highly enriched? Solve this problem using the numbers given here. You do not need the actual data. (4 pts)

199 * 0.05 = 10

You might find the following paper by Hughes et al. helpful when answering the next part.

Hughes et al., Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J. Mol. Biol. 296: 1205-1214.

2.2 Tavazoie et al. used the program AlignACE to identify over-represented motifs in each cluster. The program can be accessed at the following site: Here you will use it to analyze the upstream regions of genes present in cluster #14. The full data set is available at However, you will only need the gene names from cluster #14, which are listed below. Note that it might take several minutes for the program to run. (10 pts total)

YAR035W

YBL002w

YBL009w

YBL034c

YBL063w

YBR084w

YBR161w

YBR243c

YCLX08c

YCR075c

YCR086w

YDL211c

YDR091C

YDR113c

YDR247W

YDR297w

YDR356w

YDR499W

YEL003w

YEL017w

YEL018w

YEL061c

YER001w

YER003c

YER011w

YER016w

YER018c

YER047c

YER129w

YGL093W

YGR014W

YGR055W

YGR098C

YGR099W

YGR140W

YHR061C

YHR172W

YHR173C

YIL169C

YJL092W

YJL118W

YJL176C

YJL201W

YJR112W

YKL029C

YKL032C

YKL037W

YKL092C

YKL127W

YKR010C

YLL021w

YLR045c

YLR209C

YML053C

YMR003W

YMR056C

YMR136W

YMR144W

YNL087W

YNL088W

YNL112W

YNL126W

YNR009W

YOL155C

YOR175C

YOR188W

YOR195W

YOR247W

YOR290C

YOR309C

YPL016W

YPL127C

YPR106W

YPR141C

2.2.1. AlignACE lists its results in the order of decreasing MAP score. Thus, Motif 1 is the best motif in terms of MAP score. Paste below the output for Motif 1 and Motif 2 when you run AlignACE using the gene names from cluster #14. (5 pts)

Answers may vary:

Motif 1

AAAAAAAAAAAA 2 78 0

AAAAGAAAAAAA 5 272 0

AAAAAAAAAAGA 30 85 1

AAAAGAAAAATA 31 103 1

AAAAAAAAAAAA 11 368 1

AAAATAAAAACA 31 333 1

AAAAAAAAAACA 33 132 1

AAAAGAAAAACA 12 143 0

AAAATAAAAATA 35 176 0

AAAATAAAAATA 13 172 0

AAAAGAAAAATA 38 74 0

AAAACAAAAAAA 13 375 0

AAAAAAAAAAGA 39 349 0

AAAAAAAAAAAA 13 474 1

AAAAGAAAAAAA 41 223 1

AAAAAAAAAAAA 41 289 1

AAAAAAAAAAAA 42 98 1

AAAACAAAAAAA 59 111 1

AAAACAAAAACA 59 99 1

AAAAAAAAAAAA 57 263 0

AAAAAAAAAAAA 6 88 1

AAAAAAAAAAAA 57 251 0

AAAAAAAAAAAA 42 110 1

AAAATAAAAAAA 47 213 0

AAAACAAAAACA 48 131 0

AAAACAAAAACA 48 143 0

AAAACAAAAACA 49 129 1

AAAAAAAAAAGA 16 527 1

AAAATAAAAAAA 53 373 0

AAAAAAAAAAGA 56 327 0

AAAAAAAAAAAA 17 114 1

AAAAAAAAAAAA 61 342 0

AAAAAAAAAACA 64 786 0

AAAAGAAAAAAA 65 749 1

AAAAGAAAAAAA 20 30 0

AAAAAAAAAAAA 20 139 0

AAAAAAAAAATA 4 185 1

AAAAAAAAAAAA 6 100 1

AAAAAAAAAAGA 21 7 1

AAAACAAAAACA 25 137 0

AAAACAAAAACA 25 149 0

AAAAGAAAAAAA 29 291 0

AAAATGAAAAGA 54 24 0

AAAAAGAAAAGA 56 379 1

AAAAAGAAAAAA 65 534 1

AAAATGAAAAAA 65 709 1

AAAACGAAAACA 20 261 0

AAAACGAAAAGA 27 463 1

AAAAGGAAAATA 7 295 1

AAAAAAAAAGCA 35 64 1

AAAAAAAAAGAA 47 346 0

AAAAAAAAAGGA 55 396 1

AAAAGAAAAGCA 16 280 0

AAAAGAAAAGGA 64 603 0

AAAAAAAAAGAA 65 435 0

AAAAGAAAAGAA 47 724 1

AAAAAAAAAGGA 27 406 0

CAAACAAAAACA 31 278 0

CAAAAAAAAACA 64 239 1

CAAAAAAAAATA 51 83 0

CAAAGAAAAATA 62 271 1

CAAATAAAAAAA 28 86 1

AAAACAAGAACA 38 215 0

AAAAAAAGAATA 19 90 0

AAAACAAGAACA 51 54 0

AAAACAAGAACA 6 250 0

GAAAGAAAAAAA 69 288 1

GAAATAAAAACA 36 422 0

GAAAAAAAAAAA 38 487 1

GAAATAAAAAAA 57 237 0

GAAATAAAAACA 6 337 0

AAAGGAAAAATA 52 227 0

AAAAAGAAAGGA 67 466 0

AAAAAGAAAGTA 34 39 0

AAAATGAAAGGA 51 137 1

CAAACGAAAAAA 59 133 1

CAAAGAAAAGGA 49 96 0

CAAAGAAAAGAA 1 448 1

CAAACAAAAGCA 8 452 1

AAAAGAAAAAAG 13 486 1

AAAATAAAAAGG 65 865 1

AAAAAAAAAATG 41 186 0

GAAAAGAAAATA 14 484 1

GAAACGAAAATA 3 48 1

AAAATAAGAGAA 13 460 1

AAAAAAAGAGTA 10 231 1

AAAAAAAGAGTA 27 58 0

AAAGTGAAAACA 53 446 0

AAAGCGAAAATA 2 142 1

AAAGAGAAAAAA 18 79 1

AAAGGGAAAAAA 25 229 1

CAAAGAAGAAAA 63 52 0

CAAACAAGAAGA 32 369 0

CAAACAAGAAAA 32 476 1

CAAAGAAGAAGA 4 478 0

CAAGAAAAAAGA 50 50 1

AAAGGAAGAAGA 47 471 1

AAAGAAAGAAAA 66 418 0

CAAAAGAAAGCA 22 215 1

GAAGAAAAAAAA 26 62 1

GAAGAAAAAACA 42 4 1

AAAACAAAAGTG 18 675 1

AAAATAAAAGCG 32 449 0

AAAATGAGAGAA 64 704 0

AAAGAGAAAGGA 18 54 1

AAAGGGAAAGAA 26 420 0

AAAGAGAAAGCA 50 163 1

**** ***** *

MAP Score: 161.157

Specificity score: 7.6e-04

Motif2
TGCTGATGCTGATGC 0 239 0
TGCCGCTGCTGCTGC 14 21 1
TGTTGACGTTGGTGC 13 82 1
TGTAGGCGCTGTTGC 36 349 0
TGTTGTGGTTGTTGC 66 81 1
TGCAGCTGTTGGTGG 22 373 1
TGTGGTGGCGGATGC 30 26 0
TGTTGCTGGTGTTGG 13 349 1
TGCCGCTGCTACTGC 14 36 1
TGCAGTTGTTATTGC 53 184 1
TGTTGTTGATGCTAC 55 332 0
TGCAGTTGTGGGTAC 43 37 1
TGCAGAAGGTTTTGC 8 172 0
TGTTGCTGCTATTGG 4 91 0
TGTTGCAGCTAATAC 19 278 1
TGTTGATGAGGTTAG 68 424 0
TGCGCCGGGTGATGC 47 488 1
**********
MAPScore:10.022
Specificity score: 4.1e-11
No matching motifs.

2.2.2. The article by Hughes et al. suggests significance thresholds for the MAP and group specificity scores of >=10 and <=10-10, respectively. From your output, which motif(s) have both significant MAP and group specificity scores? Choose one of the following: (a) Motif 1 only, (b) Motif 2 only, (c) both Motif 1 and Motif 2, or (d) neither. (2 pts)

Answers may vary:

(b) Motif 1 has the highest MAP score, but its specificity score is not significant. Motif 2 has a lower MAP score, but both its MAP score and specificity score are significant.

2.2.3. Suppose a motif has a very high MAP score, but the group specificity score is not significant. What does it tell you? Choose one of the following: (a) the motif is common throughout all regions of the genome, (b) the motif is common only within the tested cluster, or (c) the motif is common throughout the genome and even more so within the tested cluster. (3 pts)

(a) The high MAP score tells you that the motif appears more often than it should by chance. However, the low group specificity score tells you that this motif is no more frequent in this cluster than it is in other clusters. Thus, the motif is common throughout the genome.

You might find the following URL helpful in answering these questions: .

2.3 Sequence Logos as visual representations of motifs. (6 pts total)

2.3.1 What can you conclude if the height of an “A” in a motif is 2 bits tall (in one sentence)? (3 pts)

The “A” is 100% conserved at that position in sites represented by the logo.

2.3.2 What can you conclude if the total height of a sequence logo at a particular position is close to zero (in one sentence)? (3 pts)

There is much variability at that position in the sites represented by the logo. So, the particular base at that position might not be biologically important.

3: Markov Chains and Hidden Markov Models (33 pts total)

3.1 In a Hidden Markov Model (HMM), you do not know which states were used to generate a given outcome. However, you can calculate the probability that the outcome came from a specific state. In this problem, you will create a HMM to predict whether it is more likely that a given sequence came entirely from a CpG island or from a non-CpG island. Use the data given below to construct the HMM. (24 pts total)

The four sequences below came from a CpG island:

CCGCTC

CGAGCG

GTCGCC

CGCCAC

These next four sequences came from a non-CpG island:

GTACGA

AGCACG

GAAGCA

TCCAGC

3.1.1 The HMM will contain 8 states--four corresponding to CpG islands and four corresponding to non-CpG islands. List the 8 states in this HMM. (4 pts)

A, C, G, and T for CpG islands, and A, C, G, and T for non-CpG islands. This can be written as A+, C+, G+, T+, A-, C-, G-, and T-.

3.1.2 In addition to the states, the HMM includes several “initial” probabilities. Start by calculating the probabilities for the first nucleotide in a sequence. Four probabilities have already been calculated for you. For example, 3 of the 8 sequences came from a CpG island and begin with a C. Therefore, we’ll write P(0 -> C+) = 3/8. Find the remaining for initial probabilities. (4 pts)

CpG island:non-CpG island:

P(0 -> A+) = 0P(0 -> A-) = 1/8

P(0 -> C+) = 3/8P(0 -> C-) = 0

P(0 -> G+) = 1/8P(0 -> G-) = 2/8 = 1/4

P(0 -> T+) = 0P(0 -> T-) = 1/8

3.1.3 There are 2 sets of 16 transition probabilities in this HMM. An example would be P(A+ -> T+), which is the probability of an A in a CpG island followed by a T in a CpG island. P(A- -> T-) would be the probability of an A followed by a T in a non-CpG island. Note that we are not considering transitions between a CpG island state and a non-CpG island states. Below, four transition probabilities have been calculated for you. For example, in the CpG island sequences, a C is followed by another base 10 times. One of these times the subsequent base was an A. Therefore, P(C+ -> A+) = 1/10. Calculate the four remaining transition probabilities that are listed. (4 pts)

P(C+ -> A+) = 1/10P(C- -> A-) = 3/6

P(C+ -> C+) = 3/10P(C- -> C-) = 1/6

P(C+ -> G+) = 5/10 = 1/2P(C- -> G-) = 2/6 = 1/3

P(G+ -> C+) = 4/6 = 2/3P(G- -> C-) = 3/6 = 1/2

3.1.4 Now, use the results from 3.1.2 and 3.1.3 to calculate the probability that GCCGCA is part of a CpG island. Also, calculate the probability that it is part of a non-CpG island. Show your work. (8 pts)

P(CpG island) = Pb(G+)*P(G+ -> C+)*P(C+ -> C+)*P(C+ -> G+)*P(G+ -> C+)*P(C+ -> A+)

= (1/8)*(4/6)*(3/10)*(5/10)*(4/6)*(1/10) = 240/288000 = 1/1200 = 0.000833