Biotechnology Homework 3 Fall 2009 Answers

1. (i) One method is to consider groups of STSs on any one BAC and realize that these must be clustered (with the remaining markers to one side or the other).

Thus, the clusters are

abde

abe

adfh

acdfh

abde

abeg

adef

adf

Pick any pair & start making progress…

abde vs abe means d is to one side of a, b and e

adfh vs abe means d,f, h are to one side of a, and b, e on the other side

abeg vs abde means g and d are either side of abe

adef vs abe means e is closer to a than b

From the above, must be g b e a d [fh]

Adef implies f is before h, and acdfh vs adfh implies c after h

Hence, g b e a d f h c must be the order and BACs minimal extent is defined by included STSs and maximal extent defined by STSs that are absent, giving the map below (left vs right has no meaning here so the mirror image drawing is the same).

g b e a d f h c

6 ------X------X------X------X-----

2 ------X------X------X------

1 -----X------X------X------X-----

5 -----X------X------X------X----

7 -----X------X------X------X----

8 -----X------X------X------

3 ---X------X------X------X------

4 --X------X------X------X------X----

(ii) In this example there was no ambiguity about STS order (i.e. the mapping process orders the BACs and the STS markers at the same time without requiring any prior knowledge). We have no explicit information about the distance between any pair of adjacent STS markers. If we knew the total number of markers and the total genome size we could calculate the average spacing of markers. Since our BACs are around 200kb the above map implies the spacing of the markers is roughly 50kb. In a real mapping project the density of markers might be similar or perhaps twice as high (i.e. at least 50-100,000 STS markers for mapping the human genome, and probably at least 300,000 BACs [3-5 times the density shown here]).

(iii) The simplest approach would be to march down the chromosome, as in chromosome walking. So, to move to the left we could collect all BACs that have STS g and cluster them together as above, extending the assembly to the left. We could do the same for STS c on the right and continue to do this until we find a problem (a gap). There are many other possibilities but there must be some sort of logical progression (a plan), even if that is eventually converted to a computer program. Of course, we must have many arbitrary starting points (as above, by looking first at STS a) in order to seed multiple long assemblies (because there are several chromosomes and because we will not straight away be able to link all BACs together on any one chromosome).

Some answers confused this type of large contig building with putting DNA sequences together into a contig. The differences are really in scale and practicality. For example, it was suggested that BAC contigs could be extended by primer walking. If possible that would be extremely slow (traveling less than 700bp at each step. However, you need a suitable template for primer walking and cannot nominate one when you cannot identify a cloned piece of DNA covering the required region. The question is essentially about identifying that cloned DNA.

(iv)(a) There are many possible ways. One is informational. Gradually we build up catalogs of transposable element sequences and of other known repeated sequences and computers can compare any new sequence to that growing database in order to exclude known repeated sequences. Experimentally, hybridization to genomic DNA could give you the required information. A Southern blot is likely to be very clear (multiple bands with several enzyme digests or stronger bands if tandem repeats not separated by enzyme sites). FISH may show more than one site of hybridization (but not for clustered or tandem repeats). Each of the above is quite a lot of work, so you might consider alternatives. One could use the STS probe on a dot blot and take care to have controls allowing you to measure the strength of the hybridization signal. It is possible also to try a “reverse Southern” where you have a large number of PCR-amplified STSs on a filter and hybridize to labeled genomic DNA. Repeated sequences will give a notably strong signal.

You could just take STS primers and amplify from genomic DNA. If you see more than one band you do not have a unique sequence. However, a single band still allows the possibility that you are amplifying several identical sequences from different locations. That could be detected by quantitative PCR but that is a significant extra amount of work when testing many STSs. Each of the above approaches has some validity but the work required to be certain that an STS is unique means that is not always done (and is not a big problem if use of an STS can make clear that it is unsuitable).

Some answers suggested that we use BLAST to match an STS sequence to the genomic sequence. We cannot simply do that if the objective of the project is to map BACs of a genome of unknown sequence (if the sequence is known mapping BACs is trivial). As mentioned earlier, there can be a contribution of sequence comparison to a growing pool of known repeated sequences.

(b) If you used an STS that in fact contained repeated sequences (say, in a hybridization experiment using the STS probe) you would find that a very large number of BACs scored positive for that STS. Thus, simply looking at the total number of hits for an STS can alert you to a problem. If you then compared this number of hits to those of other markers found on the same set of BACs (picked out by that STS) you would quickly see if only the one STS is over-represented (in which case you should throw out that STS data) or if a unique genomic locus happened to be over-represented.

If you were checking STS content by PCR the same idea applies but you might have a problem less frequently because an STS would be operationally unique by PCR tests if it simply contained at least one unique primer binding site, even if some of the amplified sequence was repetitive.

(v) Looking at only BAC 1-3 you can tell that be and dfh are either side of a (from 2 & 3) and that d falls between b/e and f/h (1 vs 2 and 3), so you can draw the relative overlap of those three BACs exactly as shown in (i). You only lack information on the order of b/e and f/h. Similarly, if you used fewer markers you will be able to order quite a few BACs and generally all of the markers. In each case you would know more about the precise distribution of BAC end-points the more ordered markers there are, since precision is guided by the distance between those markers (a BAC end-point is defined by the presence of one marker and the absence of an adjacent marker). Thus, precision is favored by more markers and (in some cases) more BACs (because that ensures all of the markers can be ordered). Thus, the important points are (a) that STS mapping is possible with relatively few BACs and markers, and (b) greater marker and BAC density will give better resolution of end-points (& of course, more end-points).

2. (i) It is obvious that BAC-1 and 3 share almost all fragments, whereas very few fragments beyond the five used to identify this set of BACs are common to any other pair. It is therefore clear that BACs 1 and 3 overlap extensively. However, for the others you might wonder if they only overlap a little or if they share a set of 6-8 fragments just by co-incidence (because size resolution is far from perfect). You could certainly dismiss the idea that all five BACs overlap by trying to merge a third BAC with any pair of merged BACs (they simply do not fit). If (other than 1 and 3) they do not all overlap then it is likely that no pair overlaps since they all have similar numbers of shared fragments. So, the best guess might be that 1 and 3 is the only overlapping pair. You might be more certain of that conclusion if you did some sophisticated calculations or if you had direct experience of this kind of experiment. However, that is not critical here. What is important is to conclude that there is no convincing evidence of overlap, and hence that you should not try and merge BACs unless they have very extensive overlap of fragments.

It is critical for this whole question to appreciate the idea that two fragments of apparently the same size are not necessarily of exactly the same size and therefore not necessarily from the same region of the genome. Non-equivalence of that kind will be extremely common because the number of fragments examined is enormous (maybe about 50 x 200,000) and official resolution (50bp here) is nowhere near single bp resolution.

(ii) If the library is dense enough you would hope to find many pairs like BAC1 and 3, which overlap extensively. So, an appropriately conservative strategy would be to take BAC1 and find the BACs with the closest matches in restriction enzyme fingerprint. BAC-3 might be the closest on one side and another BAC might be very close to BAC-1 on the other side (displaying one or two unique fragments not pleasant in BAC-3). You could then (proceeding in each direction outwards) take this new BAC and, separately BAC-3 and find their closest matches and add them to the growing merge. In this way you can keep taking small steps outwards without ever having to question whether overlaps are by co-incidence.

(iii) If you had fewer BACs you would not be able to make the easy decisions (about genuine overlap) described above. Instead, in most cases, even with the closest pair of BACs (in terms of the number of apparently common restriction fragments) you would be uncertain of whether the BACs genuinely shared overlap. Exactly what the cutoff should be to determine if a match should be accepted is not trivial to determine. It will depend on the accuracy of measuring fragment lengths, the size of the genome and the number of BAC clones (and is determined on those bases for real mapping projects- termed the Sulston cutoff). From the data here (which are not authentic) we argued in (i) tht the five BACs cannot all overlap, yet roughly 25% of fragments were in common between pairs of BACs. From that you would say that 25% of similarly sized fragments is not good enough to infer an overlap. Hence, you would require greater overlap of BACs to be able to call an overlap with this method. Notice that this is very different to the analogous answer for STS mapping. STS mapping can work with quite a small library but fingerprinting requires very extensive overlaps, and hence a very big library.

(iv) (a) Fingerprint analysis is usually done with a 6-cutter enzyme such as EcoRI or HindIII. Such enzymes cut on average once every 46 bp or roughly every 4kb. Of course, you will occasionally get a 35kb fragment and a 25bp fragment but on average you would expect to see about 50 bands from each BAC and the majority of bands in the 1-10kb range. You would expect to be able to position almost all of the restriction sites (small fragments are ignored) on a map of merged BACs and hence any BAC end-point can be assigned to a specific restriction fragment. You may even be able to figure out the size of the terminal BAC fragments to see what proportion of the partially retained terminal HindIII restriction fragment is retained. Hence, the precision of positioning any one end (or two end-points relative to each other) will likely be about 1-10kb (and often on the smaller side). As discussed in Q1 STS markers may well be 20-50kb apart, giving an equivalent uncertainty in pinning down BAC end-points.

To appreciate a difference in resolution some sort of quantitative estimate must be made and stated.

(b) If a deletion included an STS marker you would see that a BAC could not be mapped perfectly relative to others (it would have an internal STS missing), alerting you to a problem. However, most small deletions would fall between adjacent STSs and would not be apparent at all by STS mapping. If you happened to pick one such BAC clone as part of a minimal tiling path for sequencing you would end up with the wrong sequence for that portion of the genome. However, if you had a deletion of 1kb or more (fragments <500bp are ignored in fingerprinting) you would be almost certain to find that the BAC in question cannot be perfectly aligned with overlapping BACs (and you would not use it as a representative of the genome for sequencing).

Note that in Q1&2 you are being asked specific questions related to ordering of clones because that is the overall purpose. The primary purpose is not draw a restriction map or to position ends of BACs precisely (although such efforts can be made if that turns out to be important for a few particular BACs).

3. (i) It is very important to generate a diversity of starting-points for sequencing. As an absolute minimum you would require starting points to be separated by no more than the length of a single sequence (say 700nt) in order to be sure you get full coverage of any strand. Sau3A sites separated by 2-3kb would make that impossible. In reality you require even more extensive staggering of start-points so that you can cover any region of sequence several times. For BAC alignments overlaps need only be several kb and so the spacing of Sau3A sites is adequate (to allow alignments and multiple overlapping BACs which are not identical).

(ii) In the first, bulk phase of a shotgun sequencing project you simply take each sequencing read as a separate piece of information. So long as it represents correct contiguous sequence it does not matter where the sequence read came from. In a 2x 2kb composite clone a sequencing read (from each end) would not be long enough to cross the junction between the two inserts so the entire sequence being read should be correct. Although mate-pair information can be crucial to finish a sequence that information is never used for the majority of sequenced clones (but indeed a composite would give mis-leading mate-pair information, perhaps leading you to test an incorrect scaffolding arrangement & attempting to link sequences that cannot be linked. Note, however, that you would not actually link the two sequences incorrectly unless you performed primer walking on the composite clone- which would actually reveal that it is longer than expected).