Supplementary Material for
A high-resolution multi-strain haplotype analysis of laboratory mouse genome reveals three distinctive genetic variation patterns
Jinghui Zhang, Kent W. Hunter, Michael Gandolph, William L. Rowe, Jenny M. Kelley, Michael Edmonson and Kenneth H. Buetow
Laboratory of Population Genetics, National Cancer Institute, National Institutes of Health, 8424 Helgerman Court, Room 101, MSC 8302, Bethesda, Maryland 20892-8302, U. S. A.
Mapping SNPs to the mouse draft sequence
We have mapped SNPs from Celera, Whitehead Institute (WI) and Genomics Institute of the Novartis Research Foundation (GNF) to the February 2003 release of mouse genomic sequence (NCBI Build 30, MGSCv3, chr16.fa) downloaded from UCSC Genome Bioinformatics web site (http://genome.ucsc.edu/cgi-bin/hgGateway?org=Mouse&db=0&hgsid=26315896). As described in Wiltshire et al (Wiltshire et al. 2003), SNPs in the Celera data set were discovered from Celera Assembly R13 in which 129S1/SvImJ was sequenced to 0.3-fold coverage.
The Celera SNPs were mapped in two steps. First we mapped SNPs in non-repetitive regions using the program powerblast (Zhang and Madden 1997). The 99Mb chromosome 16 genomic sequence was split into 1Mb segments with 100kb overlap between each segment, and each 1Mb segment was searched against a database file with flanking sequences of all Celera SNPs. Repetitive sequences were masked during BLAST search (expect value =1e-10) but not in the computation of gapped alignment (for which the SIM4 (Florea et al. 1998) algorithm was used). 54,001 SNPs with a maximum of 2 mismatches or small gaps (<1kb) were considered good hits. We then mapped the remaining SNPs in repetitive regions using the blastall (Altschul et al. 1997) program. The SNP sequence was used as query to search against the genomic sequence file with the expect value set to 1e-10. Neither the repetitive sequences nor low complexity regions were masked. BLAST hits with >=100bp aligned residues and <=2 mismatches or small gaps (<=10bp) were retained. Blast hits that mapped to multiple genomic locations with the same sequence identity score were filtered because they were likely to represent variations in paralogous duplications instead of genetic polymorphism. SNPs with inconsistent mapping order on the Celera and MGSCv3 assembly were identified and excluded from haplotype analysis.
Both the WI and GNF SNPs were mapped to chromosome 16 using the method developed for mapping Celera SNPs in repetitive regions. All 2,347 WI SNPs were mapped successfully to chromosome 16 genomic sequences. However, substantial data inconsistency was observed in the GNF data set.
A total of 142 loci on the GNF data file, mpd133loc.zip, were assigned to chromosome 16 by their values either on column 3 (chromosome) or on column 6 (p_chr). We found inconsistent chromosome assignment on the two columns for 15 loci (46.MMHAP31FLH4, AF034092, AI256693, AI747533, D32072, AA410145, U19380, AI853701, AI591627, AI647066, AA682079, AI255180, AI325074, C76608, AA516743, AL033325, U27830, AW048328, AI255777, AL022808, 14.MMHAP61FLE2, D16333, C78391, AI505953, C76690). We decided to analyze all loci because there is no documentation to account for the inconsistency. 9 loci (AI256693, AI505953, AI747533, AL022808, AU018521, AV266914, AW456757, C76608, Y09865) were excluded due to lack of corresponding genotype data in the mpd133strains file. Of the remaining 133 loci, 18 (including a total of 85 SNPs) map either to a different chromosome or to multiple genomic loci with same sequence identity score (Table s1). These loci were excluded from further analysis. In addition, one SNP (at position 256 of locus C80713) was excluded because its genotype data shows that the marker is monomorphic.
SNP classification in gene coding regions
We observed that current SNP annotation in mouse RefSeq data may not include markers with short flanking sequences in the exons. For example, only 4 out of 6 SNPs in the coding region of Tnk2 gene are annotated on RefSeq NM_016788.1. The remaining two cSNPs (Celera accession number mCV23502520 and mCV23502501) are not annotated because they are located within 20-bp of the intron-exon boundary. We therefore decided to recompute SNP classification in gene coding regions. Using the powblast (Zhang and Madden 1997) program, we mapped a total of 455 RefSeq records (obtained from NCBI (ftp://ftp.ncbi.nih.gov) on March 13) to mouse chromosome 16. We compared our results with the RefSeq annotation on the most recent GoldenPath mouse map (refseq_3_03.txt). 5 RefSeq records annotated on UCSC map were missing in our results, four of which (NM_008735, NM_027515, NM_027632, NM_027894) are obsolete in the current RefSeq database while the remaining one (NM_021713) is a chimeric sequence of which 80% of the mRNA sequence maps to chromosome 15. An additional 22 RefSeq mRNAs were found only in our results; all were new records released in March 2003. Using the alignments between the Refseq sequences and the chromosome 16 genomic sequence, we mapped the SNPs to the RefSeq sequences and computed SNP classification of protein coding changes. The results are summarized in Table s2. The GNF data has highest fraction of coding SNP (18.4%) and in 82% of all GNF SNPs the 7 laboratory inbred strains have the same allele while one of the two wild-derived inbred strains (CAST/Ei and SPRET/Ei) has a different allele.
Consistency between the haplotype maps and genotype data in independent studies
A program was developed to compare haplotype block structure in one study (referred to as study A below) with SNP genotypes in another (referred to as study B). The SNP genotype data in study B was converted into the same integer representation as the haplotype block structure in study A which uses the integer 1 or 2 to represent alleles identical to or different from the allele of the C57BL/6J strain respectively. For each SNP in study B, the program first tries to identify a haplotype block in study A that overlaps with the SNP genomic location. If such a haplotype block does not exist, the program then finds two flanking haplotype blocks in study A as the matching haplotype blocks for the SNP in study B. If the integer values of the common strains are identical between the haplotype block and the SNP genotype, then the haplotype block is considered consistent with the SNP genotype data. For SNPs that were mapped indirectly to two flanking haplotype blocks, the program requires only one of the flanking haplotype blocks to match the SNP genotype consistently. The results are presented in Table s3.
WI haplotype blocks of C57BL/6J and 129S1/SvImJ strains were extracted from the following file http://www.broad.mit.edu/personal/claire/strainsnplist_all.xls, which contains the start position of each shotgun sequence read (in the coordinates of February 2002 mouse assembly) and its p value to be classified as a SNP-rich (jungle) region. The genomic coordinates were converted to that of the mouse assembly of February 2003 by the SNP positions in these two assemblies. Regions with p value <5% are considered SNP-poor (non-jungle) while the remaining are considered SNP-rich (98.8% of which have p value >=95%). A total of 2,643 shotgun reads from 129S1/SvImJ were listed in this file. Assuming an average of 800bp length per shotgun read, the total sequence coverage is 2,114,400bp, approximately 2% of the entire chromosome 16.
We followed the description in Wade et al to define regions with consistently low or consistently high SNP rate. Neighboring reads with the same classification (SNP-rich or SNP-poor) were merged into one genomic interval. Neighboring reads with different classification were not merged, and the sequence in between is considered a region with no data coverage. In all, SNP-poor blocks, SNP-rich blocks and regions of no coverage comprise 24%, 68% and 8% of chromosome 16 respectively. The blocks generated by this process are almost identical to the GIF image of 129S1/SvImJ haplotype map at the WI web site (http://www-genome.wi.mit.edu/personal/claire/chromosome.html).
We downloaded the Microsoft Excel file SNPviewC.pl from the GNF SNP web site (http://www.gnf.org/cgi-bin/SNPview) to obtain the GNF haplotype map.
Selection of target regions for resquencing
Table s4 summarizes the information about the 21 genomic loci selected for resequencing. The details are described below.
Using the WI SNP data from all four strains, we identified a total of 17 large (>1Mb) SNP-free genomic regions, 7 of which are consistent with the SNP-poor regions (SNP “deserts”) identified in our analysis of Celera data. For the remaining 10 inconsistent WI SNP “deserts”, we selected 1 target region per WI “desert” for resequencing using the following criteria: a) the target region is located in the middle of a WI “desert” (with a minimum of 100 kb distance from either end of a WI SNP “desert”) to avoid selecting a patch of SNP-rich region at the border of a WI SNP “desert”; b) the region should have high SNP density with multiple SNPs detectable in one amplicon; c) regions with genetic variations between 129S1/SvImJ and C57BL/6J, the two strains that were included in both the WI and Celera studies, are preferred; d) regions that form large haplotype blocks are preferred; e) regions that can support additional analysis (eg, “orphan” SNPs, segmented haplotype blocks) are preferred.
We selected two large genomic regions for resequencing to investigate the SNP “deserts” discovered in this study, to validate high variability of SNP density in large haplotype blocks, and to investigate relationship between gene structure and SNP density. The first region is a 15kb genomic segment located at 82,997,729-83,012,888bp of chromosome 16. It is part of a 1.74Mb SNP desert located at 81.40Mb-83.14Mb. In addition, this segment also overlaps with the longest haplotype block (located at 82,865,855-84,403,376bp) discovered in our study. This 1.5Mb haplotype block include a total of 227 SNPs, one of which is located in the 15kb resequencing region in the SNP desert. We also selected 3 additional genomic segments, representing regions with high (28 SNPs per 10kb) and low (1 SNP per 10kb) SNP density in this large haplotype block. The second genomic region selected for resequencing is the gene-coding region for RefSeq NM_145481 located at 35,621,795-35,633,606bp. The region is part of a 44-SNP haplotype block located at 35,610,363-35,687,737 bp. The 1.8kb 3’ UTR of the gene contain 6 Celera SNPs while the remaining 6.9kb gene coding regions, including coding exons and introns, contain only 2 SNPs. We selected five genomic segments to resequence a total of 5,172bp genomic sequence in exons and intronic sequences flanking the exons.
To investigate genetic variations between the two 129 strains, 129S1/SvImJ and 129X1/SvJ, we selected one haplotype block located at 39,554,019-39,660,183bp with a total of 104 SNPs between 129S1/SvImJ and 129X1/SvJ strains. We resequenced 3 segments with high SNP density (3-7 SNPs per amplicon) located in the middle and at the two ends of this haplotype block.
We selected 4 regions where the genotypes in the WI study are inconsistent with our haplotype block structure, two of which are small haplotype blocks (680bp and 9,686bp) while the others are large (535,289bp and 241,157bp). We selected 3 regions where GNF genotypes are inconsistent with our haplotype block structure (the haplotype block sizes for the three regions are 19,623bp, 76,432bp and 125,547bp respectively).
Resequencing Results
a) SNP validation rate for Celera, WI and GNF data
There were a total of 132 Celera SNPs, 22 WI SNP and 8 GNF SNPs in the regions selected for resequencing. 1 Celera SNP (mCV24912451) and 1 WI SNP (WGS_16_7677994) were found to be false positive substitution variations within simple tandem repeats. 3 GNF SNPs (AI323795 at position 99, AI507462 at position 2 and C81315 at position 71) were found to be non-polymorphic among all strains including the 12 wild derived inbred strains. The SNP validation rate was 99%, 95%, and 63% for the Celera, the WI and the GNF data sets in these regions respectively.
b) Regions identified as SNP-poor across all strains in the WI study
In the Celera data set, regions corresponding to 10 SNP-poor regions across all strains in the WI study (ranging from 1.0Mb to 1.7Mb in their genomic span and the total genomic span is 13.8Mb) have an average density ranging from 2 to 9 SNPs per 10kb in the Celera data set (Table s5). In all these regions include 11% of the total SNPs in the Celera data set. We selected one target region per WI SNP-poor region for resequencing, and the results showed that in the target regions the average density for SNPs derived from the five strains in the Celera data, the four strains in WI study and all laboratory inbred strains are 4.7, 4.3 and 5.2 per kb respectively (Table s5), confirming our initial observation that these regions are genetically heterogeneous among laboratory inbred strains.
c) Regions identified as SNP-poor across all strains in this study
In the 15kb target region found to be SNP-poor through our analysis, only 3 SNPs (one of which was in the Celera data set) were discovered among the laboratory inbred strains in the resequencing. When wild-derived inbred strains were included, a total of 377 SNPs were discovered in this region, suggesting that inbreeding alone would not result in SNP poverty (Table s7).
d) Regions where the haplotype map constructed in this study fail to predict the genotypes in the WI and GNF studies
The 3 invalid GNF SNPs accounted for the three discrepancies between the GNF genotypes and the haplotype structure constructed in this study. Of the four discrepancies between the WI genotypes and our haplotype blocks, one could be attributed to the invalid WI SNP, two were the “orphan” SNPs skipped during haplotype block construction and one was located in a highly fragmented “erosion” haplotype block.