Biotechnology Homework 4 Fall 2010 Due on Oct. 13th

1. Imagine you have an ordered library of mouse genomic BAC clones (of roughly 200-250 kb each), named BAC-1, BAC-2 etc. (“1”, “2” etc. in the table below). You also have a set of STS markers, named a, b, c, d etc. You are trying to order the clones by STS content mapping. The table below presents all of the BAC clones that include STS-a. It is somewhat realistic that a particular sequence might be found in at least 8 clones because the library has to have many copies of any one region to provide good coverage. The presence (+) or absence (-) of an STS in a particular BAC clone (which could be determined experimentally by hybridization or by producing a PCR product from a particular BAC as template using primers for a specific STS) is indicated.

a b c d e f g h

1 + + - + + - - -

2 + + - + - - - -

3 + - - - + + - +

4 + - + - + + - +

5 + + - + + - - -

6 + + - + - - + -

7 + - - + + + - -

8 + - - - + + - -

(i) Draw the overlap of the named BAC clones that you can deduce from the above data. Explain your method of getting the results (otherwise I cannot tell how you thought about the problem & will not give you full credit; if you have a logical method you can imagine writing a computer program to implement it on a larger scale than here). [2]

(ii) The STS markers you use in such a mapping project would essentially be random, unique sequences (you would know nothing about their positions). Does the mapping process in this example tell you the exact ORDER of STS markers AND does it tell you their DISTANCE apart?

[1]

(iii) I only tabulated the BACs containing STS-a and only included STS markers that were positive in at least one of the BACs shown. In the real, entire mapping experiment there would be thousands of BACs and each would be tested for thousands of STS markers. How would you extend your map to the left or the right using the additional information that would be available from the entire mapping experiment? (Do not just say use a computer- you have to tell the computer what to look for first). [1]

(iv) An STS is supposed to be a unique sequence (not a repeated sequence).

(a) Before you use an STS in the mapping experiment described here how could you find out if it is unique? Describe TWO ways, commenting on which you think is better or more efficient. Remember, we do not know the complete genome sequence for the genomic clones we are mapping. [2]

(b) If you mistakenly used an STS in the above type of mapping experiment (using the STS as a hybridization probe) what results would suggest that the “STS” hybridization probe actually contains repeated sequences? [1]

(v) Imagine your BAC library is half as big, so you only have data for BACs 1, 2, 3 and 4. Draw a map & indicate any uncertainties.

Imagine you have BACs 1-8 but you only tested STSs d, e, g and h. Draw a map and indicate any uncertainties.

What is your general conclusion about whether this type of mapping requires a critical concentration of BACs or markers? [2]

2. BAC clones can also be ordered by restriction enzyme fingerprinting. A typical BAC may be around 200kb and EcoRI may cut on average roughly once every 4kb. Hence, a typical BAC EcoRI digest may contain 40-50 bands. The accuracy with which band sizes can be measured is very important for this type of mapping. It is likely better than to the nearest 50bp but for convenience in this question let us assume that recording a band (on a computer file) as 2.1kb means it could be anywhere from 2,050bp to 2,149bp. Also, band intensities are measured, so if a particular size (say 3.2kb) has a band that is three times as strong as expected it is recorded as 3.2 x 3. It is very important to realize that the size of a band is not accurate enough to identify to bands as being identical. Two bands recorded as having the same size will very often come from different parts of the genome.

(i) In question 1 we were able to focus on a small subset of data by looking only at BACs with the STS-a. The first step here is not so obvious. That is an important point. As a first idea (but not necessarily a good one) we might start by picking out a particular signature of EcoRI band sizes that we choose randomly from the digest of a single BAC clone. Here we will take BAC1 and pick out five bands- 15.1, 8.3, 4.3, 2.3 and 1.7kb. We will then look for all BACs that include that signature. That signature will be quite common (because size resolution is not exact), so among 200,000 BAC clones (roughly the number likely to be used) there may be a few hundred with that signature. I am picking out just four BACs (randomly) from many more with this signature (it is important to realize that this is just a small sample of those BACs) and I am displaying the EcoRI fragments from just those BACs (BAC 1-5 altogether). I am also saving some paper by pretending that the BACs are a little shorter than normal and produce fewer restriction fragments than normal.

BAC-1 BAC-2 BAC-3 BAC-4 BAC-5 BAC-1 BAC-2 BAC-3 BAC-4 BAC-5

15.1 15.1 15.1 15.1 15.1 4.7

14.0 14.0 4.6

13.7

13.3 4.5

13.0 4.4 4.4

12.6 12.6 4.3 4.3 4.3 4.3 4.3

12.4 3.9x2

11.9 3.8

11.5 11.5 3.7x2

11.2 11.2 3.6 3.6

9.4 3.5x3 3.5

9.1 2.8x2 2.8x2

8.3 8.3 8.3 8.3 8.3 2.6 2.6

7.9 7.9

7.7 7.7 2.4x2 2.4x2

7.6 2.3 2.3x2 2.3 2.3 2.3

7.3 2.2x2 2.2

7.1 2.1 2.1x2 2.1

6.9 2.0

6.5 1.9x3

6.2 6.2 1.8 1.8x4

6.1 1.7x2 1.7 1.7x2 1.7 1.7

6.1 1.6

6.0 1.5 1.5

5.8 1.4x4

5.8 1.3x2

5.6

5.5 1.2 1.2

5.3 1.1x3 1.1 1.1

5.2 5.2

5.1 5.1 1.0 1.0

4.9 4.9 0.9

4.8 0.8 0.8x2

(i) From the data and background information provided can you be (reasonably) certain that BAC-1 derive from the same region of the genome as (i.e. does it genuinely overlap) any of the four other BACs (which ones, if any)? [1]

(ii) If you wanted to start a map with BAC-1 and build a contig of overlapping BACs from that starting point, using all of the restriction fingerprinting data from the whole project (not just what I listed), how would you proceed? [1]

(iii) In Q1 you explored whether STS mapping required a minimal number of BACs in a library in order to generate a map of overlaps. Do you think that restriction fragment fingerprinting requires a minimal BAC library size? [1]

(iv) Imagine using the same (adequately large) set of BACs successfully either for restriction enzyme fingerprinting or for STS content mapping to produce a map. The most important aspect of a map is that it is accurate. Explain why the restriction enzyme map is more useful than the STS map

(a) to detect any clones that had undergone a small deletion event during cloning or amplification

[1]

(b) to see if the deduced map of clones is actually accurate with regard to intact genomic DNA in a cell [1]

(c) to verify the DNA sequence of the whole genome once established by sequencing individual BACs or by whole genome shotgun sequencing. [1]

Please start a new page in your answers

3. Imagine you have a purified 200kb BAC that you are going to sequence by a shotgun method.

You sonicate the DNA to a measured degree and gel purify fragments of (let’s say here) about 3kb and then clone these fragments into a plasmid vector to produce a library of clones. You then isolate DNA from each clone and sequence from each end of the insert (by using different primers complementary to vector sequences) for each DNA. This generates many DNA sequences of roughly, say 700nt. If there are enough such sequences, few errors and no awkward segments of DNA (such as repeats) a computer program can align pairs of sequences and progressively build up large segments of sequence from these merges, hopefully spanning the whole 200kb.

(i) Why is sonication used instead of a partial Sau3A digest for shotgun sequencing, even though either partial Sau3A digestion or sonication would be good enough to make a BAC library covering a whole genome? [1]

(ii) When making a BAC library from a genome it is very important not to make composite clones (two inserts ligated together). In generating the 3kb insert plasmid library for shotgun sequencing (in just the way I described above) what would be the consequence of using a plasmid that had two 3kb inserts? [1]

(iii) To build up a sequence of the BAC do we need to determine experimentally the complete sequence of each 3kb plasmid clone we make? Explain. [1]

(iv) Below I am writing down some sequences you might obtain in such an experiment, but I am simplifying in a couple of ways. First, I am only giving you a small subset of the sequences (picked out to include several overlaps). Second, I am writing all of the sequences in the same sense. In reality, about half of the sequences would actually have been the complement of what I am writing (in fact I label those as “comp”). In a real sequencing project a program would deal with sequence quality, repeats and the best way to merge sequences but here I am just separating out those processes by asking you to merge sequences with a little help from a computer program that simply finds matches between sequences.

You may be very good at spotting sequence identities and you could conceivably do this whole question without a computer but it is easiest if you go to the NCBI web site and use the “Align” tool within “Blast”. So, go to http://www.ncbi.nlm.nih.gov/ and then go to BLAST on the upper menu bar. Then scroll down to the bottom and click on Align (http://www.ncbi.nlm.nih.gov/BLAST/bl2seq/wblast2.cgi). Leave all of the default parameters unchanged and simply paste in sequences from the question into each of the two boxes for pair-wise comparisons and hit “BLAST”. The output will be diagrammatic followed by an alignment of the identical or near-identical sequences with nucleotide numbers from the original two sequences.

Seq1-comp

ATGAACCGCTACGCGGTAAGCTCGATGGTGGGGCAAGGATCCTTCGGGTG

CGTATACAAGGCGACACGCAAGGACGACAGCAAGGTGGTGGCCATCAAAG

TGATCTCCAAGCGCGGAAGAGCCACGAAAGAGCTGAAGAATTTGCGCAGG

GAGTGCGACATTCAGGCCCGGCTGAAGCATCCGCACGTCATCGAGATGAT

CGAGTCCTTCGAGTCGAAGACGGACCTTTTCGTGGTCACTGAGTTCGCGC

TGATGGACCTGCACCGCTACCTGTCCTACAATGGAGCCATGGGCGAGGAG

CCGGCACGTCGGGTGACCGGGCATCTGGTGTCCGCTCTGTACTACCTGCA

TTCAAACCGCATCCTCCACCGGGATCTCAAACCGCAAAACGTGCTGCTCG

ACAAGAACATGCACGCGAAACTCTGCGACTTTGGACTGGCCCGCAACATG

ACCCTGGGTACCCACGTGCTCACCTCGATCAAGGGAACGCCCCTCTACAT

GGCCCCGGAGCTGCTGGCGGAGCAGCCGTACGACCATCATGCGGACATGT

Seq2

TGATCTCCAAGCGCGGAAGAGCCACGAAAGAGCTGAAGAATTTGCGCAGG

GAGTGCGACATTCAGGCCCGGCTGAAGCATCCGCACGTCATCGAGATGAT

CGAGTCCTTCGAGTCGAAGACGGACCTTTTCGTGGTCACTGAGTTCGCGC

TGATGGACCTGCACCGCTACCTGTCCTACAATGGAGCCATGGGCGAGGAG

CCGGCACGTCGGGTGACCGGGCATCTGGTGTCCGCTCTGTACTACCTGCA

TTCAAACCGCATCCTCCACCGGGATCTCAAACCGCAAAACGTGCTGCTCG

ACAAGAACATGCACGCGAAACTCTGCGACTTTGGACTGGCCCGCAACATG

ACCCTGGGTACCCACGTGCTCACCTCGATCAAGGGAACGCCCCTCTACAT

GGCCCCGGAGCTGCTGGCGGAGCAGCCGTACGACCATCATGCGGACATGT

GGTCACTGGGCTGCATAGCCTACGAAAGCATGGCCGGTCAGCCGCCCTTC

TGTGCCAGCTCCATCCTGCATCTGGTGAAGATGATCAAGCACGAGGACGT

CAAGTGGCCGAGCACGCTGACTAGCGAGTGCCGCTCCTTCCTACAGGGCC

TGCTTGAGAAGGA

Seq3

GGTCACTGGGCTGCATAGCCTACGAAAGCATGGCCGGTCAGCCGCCCTTC

TGTGCCAGCTCCATCCTGCATCTGGTGAAGATGATCAAGCACGAGGACGT

CAAGTGGCCGAGCACGCTGACTAGCGAGTGCCGCTCCTTCCTACAGGGCC

TGCTTGAGAAGGACCCCGGTCTGCGCATATCCTGGACGCAGCTGTTGTGT

CACCCCTTCGTTGAGGGACGCATCTTTATCGCAGAAACGCAGGCGGAGGC

GGCCAAGGAATCGCCTTTCACAAATCCCGAAGCCAAGGTTAAGTCGTCAA

AACAGTCCGATCCGGAGGTAGGCGATCTGGACGAGGCCCTGGCCGCTTTG

GACTTTGGCGAGTCGCGACAGGAAAACTTGACCACCTCCCGCGACAGCAT

AAACGCCATTGCTCCCAGCGATGTTGAGCATCTTGAGACCGATGTGGAGG

ACAATATGCAACGTGTGGTCGTTCCCTTCGCGGACTTGTCCTACAGGGAT

CTGTCTGGTGTTCGGGCAATGCCGATGGTACACCAGCCGGTGATCAACTC

Seq4

CCGGCACGTCGGGTGACCGGGCATCTGGTGTCCGCTCTGTACTACCTGCA

TTCAAACCGCATCCTCCACCGGGATCTCAAACCGCAAAACGTGCTGCTCG

ACAAGAACATGCACGCGAAACTCTGCGACTTTGGACTGGCCCGCAACATG

ACCCTGGGTACCCACGTGCTCACCTCGATCAAGGGAACGCCCCTCTACAT

GGCCCCGGAGCTGCTGGCGGAGCAGCCGTACGACCATCATGCGGACATGT

GGTCACTGGGCTGCATAGCCTACGAAAGCATGGCCGGTCAGCCGCCCTTC

TGTGCCAGCTCCATCCTGCATCTGGTGAAGATGATCAAGCACGAGGACGT

CAAGTGGCCGAGCACGCTGACTAGCGAGTGCCGCTCCTTCCTACAGGGCC

TGCTTGAGAAGGACCCCGGTCTGCGCATATCCTGGACGCAGCTGTTGTGT

CACCCCTTCGTTGAGGGACGCATCTTTATCGCAGAAACGCAGGCGGAGGC

GGCCAAGGAATCGCCTTTCACAAATCCCGAAGCCAAGGTTAAGTCGTCAA

AACAGTCCGATCCGGAGGTAGGCGATCTGGACGAGGCCCTGGCCGCTTTG

GACTTTGGCGAGTCGCGAC

Seq5-comp

GCGAACTGGACTCATTGAAACAGCACAACCTGGTGAGCATTATTGTGGCA

CCGCTGCGCAACTCCAAGGCCATTCCACGGGTACTCAAGAGTGTGGCCCA

GTTGCTGTCGTTGCCCTTTGTGCTGGTGGATCCTGTTTTGATTGTTGACC

TCGAGCTCATCCGCAACGTGTACGTGGACGTAAAACTGGTGCCCAATCTC

ATGTACGCCTGCAAGCTGCTCCTGTCGCACAAACAACTCTCGGACTCGGC

TGCCTCCGCCCCACTCACCACGGGTTCGCTCAGTCGAACGTTGCGTAGCA

TTCCGGAGCTAACTGTCGAGGAGCTGGAGACGGCTTGCAGTCTGTACGAA

CTGGTCTGCCACTTGGTACACCTGCAGCAGCAGTTCCTAACGCAGTTCTG

CGATGCGGTTGCCATTCTGGCAGCAAGCGATCTGTTCCTCAACTTCCTCA

CGCACGACTTCAGGCAATCGGATTCAGACGCCGCCTCTGTTCGCCTGGCT

GGGTGCATGTTGGCCCTGATGGGCTGTGTGCTGCGCGAGCTGCCCGAAAA

CGCGGAGCTTGTAGAACGGATTGTCTTTAATCCGCGGCTAAACTTCGTCT

CGCTCCTGCAGAGCCGACACCA

Seq-6

TTCCGGAGCTAACTGTCGAGGAGCTGGAGACGGCTTGCAGTCTGTACGAA

CTGGTCTGCCACTTGGTACACCTGCAGCAGCAGTTCCTAACGCAGTTCTG

CGATGCGGTTGCCATTCTGGCAGCAAGCGATCTGTTCCTCAACTTCCTCA

CGCACGACTTCAGGCAATCGGATTCAGACGCCGCCTCTGTTCGCCTGGCT

GGGTGCATGTTGGCCCTGATGGGCTGTGTGCTGCGCGAGCTGCCCGAAAA

CGCGGAGCTTGTAGAACGGATTGTCTTTAATCCGCGGCTAAACTTCGTCT

CGCTCCTGCAGAGCCGACACCACCTGTTGCGGCAACGTTCCTGTCAGCTG

CTGCGCCTGCTGGCCCGCTTCAGCCTGCGCGGCGTGCAGCGCATATGGAA

TGGAGAGCTGCGATTTGCGCTGCAACAACTCTCTGAGCACCACTCGTACC

CGGCACTCCGTGGGGAGGCCGCCCAGACCCTCGACGAGATCAGTCACTTC

ACTTTTTTCGTCACCTAG

(a) Try to align pair-wise combinations of the six sequences and draw a diagram of the merged sequence(s) you obtain and the sequence file (Seq-1, seq-2 etc.) overlaps that produce the merged sequence. [2]

(b) It is very important that all sequences are checked and trimmed to be of the highest quality. To see what happens if you work with poor sequences, imagine that you tried to read further in sequence 2 (making guesses at some nucleotide positions) to produce the sequence below:

Poor Seq2

TGATCTCCAAGCGCGGAAGAGCCACGAAAGAGCTGAAGAATTTGCGCAGG

GAGTGCGACATTCAGGCCCGGCTGAAGCATCCGCACGTCATCGAGATGAT

CGAGTCCTTCGAGTCGAAGACGGACCTTTTCGTGGTCACTGAGTTCGCGC

TGATGGACCTGCACCGCTACCTGTCCTACAATGGAGCCATGGGCGAGGAG

CCGGCACGTCGGGTGACCGGGCATCTGGTGTCCGCTCTGTACTACCTGCA

TTCAAACCGCATCCTCCACCGGGATCTCAAACCGCAAAACGTGCTGCTCG

ACAAGAACATGCACGCGAAACTCTGCGACTTTGGACTGGCCCGCAACATG

ACCCTGGGTACCCACGTGCTCACCTCGATCAAGGGAACGCCCCTCTACAT

GGCCCCGGAGCTGCTGGCGGAGCAGCCGTACGACCATCATGCGGACATGT

GGTCACTGGGCTGCATAGCCTACGAAAGCATGGCCGGTCAGCCGCCCTTC

TGTGCCAGCTCCATCCTGCATCTGGTGAAGATGATCAAGCACGAGGACGT

CAAGTGGCCGAGCACGCTGACTAGCGAGTGCCGCTCCTTCCTACAGGGCC

TGCTTGAGAAGGACCCCGGTCTGCGCATATCCTGGACGCAGCTGTTGTGT

CACCCTTCGTTCGAGGACGCATCTTATGCCAGAAAGCAGCGAGGCGCAAG

GAATCGCCTTCACAAATCCGAAGCTAAGGTTAGTCGTCAA

Align Poor Seq 2 with other relevant sequences. In cases where there are not perfect matches the merged consensus would now include uncertainty at many nucleotide positions- often represented by “N”. Sequence assembly programs then try to connect the merged sequence with new sequences (where “N” is not helpful in determining if that residue matches). If you were designing the assembly program for this set of sequences (including Poor Seq2, but imagining that you do not know beforehand that this sequence is of poor quality) what feature would you build in to the program to make the assembly proceed as smoothly as possible with the clearest, correct result? [1]

(c) In (a) you should have produced two “contigs” or merges. In a real shotgun project (with far more data than I am showing) how would you expect to be able to join these two contigs into one? (Remember that all of these sequences derive from a single BAC). [1]

(d) Imagine, however, that the two contigs remained unconnected at the end of merging many clones as described above. What experiment could you perform to try to connect these sequences (you should be sure to explain where any key reagents come from)? [1]