Supplemental Material

Supplemental Materials and Methods S1

Sequencing of flow-sorted chromosomal DNA and assembly

The chromosome 3A double ditelosomic lines in the genetic background of wheat cultivar ‘Chinese Spring’ were used to isolate chromosome arm 3AS- and 3AL-specific DNA by flow-sorting (Kubaláková et al., 2002; Doležel et al., 2007). Due to the limited amount of material generated, the chromosomal DNA was amplified according to Šimková et al. (2008) using Illustra GenomiPhi V2 DNA Amplification kit (GE Healthcare). DNA amplified from 3AS and 3AL arms was used to construct two types of libraries for 454 sequencing using standard Roche protocols: 500 bp libraries for single read and 3 kb libraries for paired-end read sequencing. In addition to flow-sorted chromosomal DNA, single read sequencing was performed for a set of 2,743 BAC clones distributed among 1,104 BAC contigs assembled using the high information content fingerprint data obtained for 190,464 BAC clones from wheat chromosome 3A ( The minimum tiling path (MTP) was selected using a minimum e-value score of e−30 using the MTP function of the FPC (v9.3) program (Nelson and Soderlund, 2009). An overlap of 25-50 kb between adjacent BAC clones was used to define the MTP. Depending on the length of a contig, two to six BAC clones per contig were selected for sequencing. The 2,743 BAC clones were re-arrayed in 14 microtiter plates each with 14 rows and 14 columns. The BAC clones were individually grown and grouped in three-dimensional pools to develop Plate (14 pools), Row (14 pools), and Column (14 pools) pools. The BAC DNA was isolated from the 42 pools, each of which was barcoded during preparation of the genomic libraries for sequencing. All libraries were sequenced using 454 Titanium chemistry in the KSU Integrated Genomics Facility ( to generate 7.6 Gb of quality-trimmed sequence data, of which 26% was represented by paired-end sequence reads. The combined dataset generated for the flow-sorted chromosome arms and BAC clones corresponded to 9.3x coverage of wheat chromosome 3A (see Supplemental Table 1 online; NCBI accession number SRA045666). Assembly was performed using gsAssembler (v2.3, Roche) with default settings (40 bp minimum overlap; 90% minimum overlap identity), except that the option optimized for large and complex genomes was selected. This option invokes algorithms especially suited for wheat genome data, allowing for efficient and speedy contig assembly by skipping the assembly of highly repetitive regions of the genome and setting the “seed count” to 2.

Scaffolds and large contigs generated by gsAssembler were masked for repetitive elements by RepeatMasker (open-3.2.9) using a repetitive element library from the Triticeae Repeat Sequence Database (

Completeness of gene space coverage

A total of 77.4% of repetitive sequences in the assembled contigs and scaffolds were masked prior to coding sequence analysis (see Supplemental Table 4 online). The 11,836 chromosome 3A contigs had best BLASTN hits (e-value ≤ e-10, alignment length ≥100 bp) in at least one model grass genome (see Supplemental Table 5 online). The full-length cDNA (FL-cDNA) sequences mapped to wheat chromosome 3A were used to assess the gene sequence coverage. A total of 136 FL-cDNAs (trifldb.psc.riken.jp) showing similarity to single copy wheat ESTs mapped to wheat chromosome 3A were selected. To exclude the possibility of using erroneously mapped wheat ESTs, we further selected only those that had significant BLASTN hits (e-value ≤ 1e-10) with the syntenic genes on rice chromosome 1, resulting in a set of 97 FL-cDNAs. Of 97 FL-cDNAs, 57 (59%) had ≥80% of their sequence covered by the chromosome 3A contigs and scaffolds, and 79 FL-cDNAs (82%) had ≥50% of their sequence represented in chromosome 3A assemblies (see Supplemental Figure 3 online). This level of coverage is lower than the theoretically expected 99% target coverage based on 7x coverage of chromosome 3A. One possible explanation is unequal representation of various genomic regions in the amplified flow-sorted chromosomal DNA used to prepare libraries for sequencing. This possibility is consistent with findings suggesting that whole genome amplification can result in biased representation of genomic targets dependent on repeat content, chromosome size and GC-content (Pinard et al., 2006; Wicker et al., 2011).

Non-3A chromosomal contamination

Based on the estimates obtained using deletion-bin mapped ESTs, up to 13% of flow-sorted genomic DNA might be represented by contaminating chromosomal material (see Materials and Methods). There is a possibility that this DNA contamination may impact the results of downstream analyses due to the presence of duplicated copies of chromosome 3A genes on other chromosomes including homoeologous genes on chromosomes 3B and 3D. The higher level of divergence of these genes from orthologous genes in the model grass genomes may bias our estimates of the rates of gene evolution.

Based on our estimates, we may have up to 13% (or 7.65 Gb of total data * 0.13 = 1 Gb) of contaminating DNA from other wheat chromosomes. If we assume that the size of wheat genome without chromosome 3A is 15.1 Gb (16Gb – 0.9 Gb), contaminating DNA provides 0.07x (1Gb/15.1Gb) coverage for the remaining fraction of the genome. Therefore, the 7.4x (6.65Gb/0.9Gb) coverage obtained for wheat chromosome 3A is more than 100-fold higher than the coverage obtained for other chromosomes in the 3A flow-sorted sample. This low level of coverage for other chromosomes will reduce: 1) the contribution of contaminating reads included into the chromosome 3A contigs to final consensus sequences, and 2) the fraction of contaminating gene sequences assembled into contigs longer than 500 bp (the threshold used in our study for including contigs into downstream analyses). Using the Lander-Waterman equation, we can calculate the proportion of, for example, a 5,000 bp-long gene fragment (the average size of a gene in plant genomes) that will be sequenced assuming that only 0.07x genome coverage is achieved. Using the equation Ge^-c, where G is the size of target sequence, e = 2.72 and c is the coverage (Lander and Waterman, 1988), we should obtain 384 bp (5,000 bp x e^-0.07) of sequenced gene fragment, which is smaller than the length of a single 454 read (which were excluded from our analyses). In other words, with 0.07x sequence coverage the fraction of assembled contigs longer than a single 454 read length will be relatively small.

If we assume that the total number of genes per diploid genome is 50,000 (Choulet et al., 2010), and that the inter-chromosomal gene duplication rate is 21% (Akhunov et al., Genome Research, 2003) then the total proportion of chromosome 3A gene homologs in the non-3A fraction of the genome is 14.7% [(8400 genes x 2 genomes + 8400 * 3 genomes * 0.21 duplication rate)/150,000 genes]. This suggests that among the chromosome 3A genes, 1.9% (0.13 x 0.147 x100) can be represented by duplicated copies from other wheat chromosomes.

Additional evidence that non-3A chromosomal contamination has made a relatively small contribution to our assemblies comes from the alignment of chromosome 3A contigs with the chromosome 3A BAC clones sequenced using the Sanger approach. The estimated sequencing error rate based on this comparison was 0.07%. Inclusion of contaminating reads from other chromosomes (for example, 3B or 3D) should have increased the error rate significantly due to the divergence of the homoeologous sequences from each other.

The impact of flow-sorted chromosome contamination on evolutionary inferences should be also reduced by selecting syntenic genes sharing positional orthology among cereal genomes. However, this approach can potentially introduce bias into the estimation of the rates of gene evolution. In some studies it has bene shown that positionally conserved genes are under stronger selective constraint and evolve slower (Notebaart et al., 2005), whereas orthologous genes duplicated by, for example transposon-mediated duplication, evolve faster (Han et al., 2009). Therefore, gene selection using positional orthology most likely will result in an underestimation of the evolutionary rates in the wheat genome, thereby producing more conservative estimates.

Effect of homopolymeric repeats on PTC origin

We identified premature termination codons (PTCs) in the two sets of genes grouped according to the presence or absence of homopolymeric sequences (HPS; >5 bp) upstream of PTC. Should the PTCs result from overcalls/undercalls in HPS, we would expect to observe frequent co-occurrence of both PTCs and HPS in the first dataset. The ratio of genes that has PTC to those that did not have PTC in the first dataset was 170/487 and in the second dataset was 135/293 (Fisher exact p-value = 0.05). These ratios suggest that the HPSs did not have a significant impact on our estimates of PTCs since we have an even smaller number of PTCs in genes that have upstream HPSs than in genes that do not have them.

Inference of putative gene order on wheat chromosome 3A using synteny with model grass genomes

The 6,714 annotated coding sequences (CDS) from chromosome 1 of rice, 5,787 from Brachypodium chromosome 2 and 4,830 from sorghum chromosome 3 were aligned with each other reciprocally by BLASTN using an e-value threshold of e-10. The BLASTN hits have been filtered using 60% CIP (cumulative identity percentage) and 70% CALP (cumulative alignment length percentage) thresholds as described by Salse et al.(2008). Genes with no obvious orthologs, as well as genes with multiple hits across the cereal genomes compared, were excluded. Orthologous sets of genes were classified as: 1) those that were syntenic in all three grass lineages; and 2) those that showed syntenic relationships between any two grass lineages. The positions of genes on rice, Brachypodium and sorghum chromosomes were used to order the putative wheat orthologs.

For detecting the centromeric region on wheat chromosome 3A, we compared sequences of contigs and scaffolds from the SGO map with raw sequence data generated from the flow-sorted 3AL and 3AS chromosomal arms using BLASTN, and selecting only the best hits with an e-value ≤ 1e-10 and a minimum alignment length of 100 bp. Only sequences uniquely mapped to one of the arms were considered.

Synteny with other grass genomes

We performed inter-genomic BLASTN comparisons of wheat chromosome 3A masked contigs and scaffolds with the annotated coding sequences of Brachypodium, rice and sorghum. As expected, the majority of BLASTN hits were distributed across the syntenic chromosomes of other cereal genomes (see Supplemental Table 6 online, Supplemental Figures 4, 5 and 6 online). Using the assembled contigs of chromosome 3A, we identified new regions in the rice genome that are collinear to wheat chromosome 3A (ranging from 20 to 80 genes) and are also duplicated on rice chromosomes 4, 8, 11 and 12 (see Supplemental Table 7 online). In addition, we confirmed previously reported ancestral duplications predating the divergence of cereal genomes, involving regions of model genomes that are syntenic to wheat chromosome 3A on rice chromosome 5 and sorghum chromosome 9 (see Supplemental Figures 5 and 6 online) (Salse et al., 2008; Paterson et al., 2009) and between rice chromosomes 1 and 10 (Yu et al., 2005).

Chromosomal segmental duplications

The chromosome 3A contigs were first compared against the gene models of the rice, Brachypodium and sorghum genomes (International Rice Genome Sequencing Project, 2005; Paterson et al., 2009; International Brachypodium Initiative, 2010). The genomes of these grass species were split into 1 Mb-size bins and the number of BLASTN hits (e-value ≤e-10; alignment length ≥ 100 bp) with wheat chromosome 3A contigs was recorded. A bin was considered to be enriched for sequences similar to wheat chromosome 3A when the number of genes with significant BLASTN hits exceeded the significance threshold (95th percentile). The significance threshold was established by performing 1,000 random permutations of random gene sets from all chromosomes and calculating the number of genes with significant BLASTN hits in the random sets of genes of the same sample size.

To identify segmental duplications involving the wheat homoeologous group 3 chromosomes, the chromosome 3A SGO map sequences were compared to 6,426 ESTs mapped to wheat deletion bins ( Munkvold et al., 2004; Qi et al., 2004). The ESTs that were mapped to chromosome arms but not to a specific bin were removed from further analysis. Alignments with high-scoring segment pairs (HSP) of at least 100 nucleotides were used for analysis. The chromosomal region was considered to be duplicated if it included at least three duplicated EST loci. Circos software (v0.53) was used for graphical display of the duplicated regions (Krzywinski et al., 2009).

Comparison of bin-mapped ESTs from the wheat chromosome groups w1, w2, w4, w5, w6, and w7 identified 249 unique BLASTN hits on the chromosome 3A SGO map corresponding to 10 segmental duplications (Supplemental Figure 8 online). Five new duplicated regions were identified in addition to five already identified by Salse et al. (2008). There were 3 duplicated blocks on chromosome combinations w3-w1, two on w3-w2, three on w3-w5, and two on w3-w7 (Supplemental Table9 online). One small duplicated region (0.4Mb) was identified on the distal short arm of w3A-w1, in addition to two already known duplication blocks on w1 and w3 (Salse et al., 2008). These three duplicated regions on w3A-w1 were detected using42 paralogous genes and represent 47% percent of chromosome3 (based on rice chromosome 1 sequence). Two of the three duplicated regions between w3A and w1 are shared between rice chromosomes 1 and 5 and are conserved in the orthologous positions between wheat and rice. Two duplicated blocks identified ina w3A-w2 comparison involved two regions on the long arm of 3A duplicated on the short and long arm of w2 (Supplemental Table9, Supplemental Figure 8 online). There were three duplicated regions identified ina w3A-w5 comparison, all involving the long arm of w5. Two duplicated regions were identified ina w3A-w7 comparison, one involving w3AS-w7L and the other w3AL-w7S (Supplemental Figure 8online). One of the duplicated regions on w3AS-w5L identified by Salse et al (2008) was also identified in the present study. However, another duplicated region, w3AL-w5S identified by Salse et al. (2008),could not bedetected. Instead, two duplicated regions involving w3AL-w5L were identified (Supplemental Figure 8 online). Thus, we identified 10 duplicated regions involving 4 wheat homoeologous groups of chromosomes. These duplicated regions were present in all three homoeologous groups of chromosomes suggesting that duplications originated before radiation of the diploid ancestors of wheat.

Functional annotation of wheat chromosome 3A syntenic genes

The sequences of chromosome 3A genes identified by the PASA pipeline and by comparison with gene models predicted in the rice, Brachypodium and sorghum genomes were compared against the NCBI nr database using the BLASTX program with an e-value threshold of e−6 and HSP length cutoff of 33. Annotation was performed using the Blast2GO suite (Conesa and Götz, 2008) with default parameters (e-value hit filter of e−6, annotation cut-off of 55, GO weight of 5). Classification graphing was carried out with a sequence filter of 50 and a node score filter of 5. Graphs were developed using level two for Biological Processes and level three for Cellular Component and Molecular Function (see Supplemental Figure 10). The chromosome 3A gene annotations were compared with those of the 55,052 transcripts included in the Affymetrix GeneChip Wheat Genome Array (representative of the entire wheat genome) by performing enrichment analysis with the FatiGo (Al-Shahrour et al., 2004) and Gossip (Blüthgen et al., 2005) packages that employ Fisher's exact test to estimate the significance of associations between reference and test sets, while correcting for multiple testing using FDR (false discovery rate), FWER (family-wise error rate) and single test p-value. GO terms that were under- or over-represented between the compared gene sets were identified using the significance threshold of 0.01 and 0.0001.

Of the 3,646 genes from the SGO map searched against the NCBI non-redundant database, only 22 sequences returned no hits (0.006%). Nearly 82% (2,988) of the 3,646 genes showed similarity to proteins with defined function, 541 genes (15%) hit unknown proteins and 68 genes (0.01%) hit a hypothetical protein. Functional classification was obtained for 2,438 genes (67%). The wheat chromosome 3A SGO’s were classified according to biological processes, molecular functions, and cellular components. Within the “biological process” classification, 26.4% of genes belonged to the “metabolic process” category, 26.2% of the genes belonged to the “cellular process” category, followed by 9% of sequences belonging to the “response to stimulus” category (Supplemental Figure 10A online). In the “cellular component” category (Supplemental Figure 10B online), 42% of sequences were classified as “cell part” and 33% of sequences were in the membrane-bounded organelle category. In the “molecular functions” category (Supplemental Figure 10C online), 16% of sequences were classified into each of the “nucleotide binding”, “hydrolase activity”, and “transferase activity” classes.

Divergence between wheat genomes

Based on previously published work (Dvorak et al., 2006), which reported that the divergence of genic sequences between wheat genomes can vary from 2% to 4%, we selected 98.5% similarity threshold for separating homoeologous genes. However, variation in the rates of mutation and selective pressure across genomes may result in variation in the inter-genomic similarity levels among genes in the wheat genome. To validate the accuracy of the selected (98.5%) similarity threshold for separating homoeologous genes from each other, we investigated the distribution of similarity levels between duplicated genes in the A-, B- and D-genomes of wheat. For this purpose we used previously published sequence data generated for 2,114 gene fragments in 32 accessions of tetraploid and hexaploid wheat (Akhunov et al., 2010; In that study the homoeologous copies of genes from the A, B and D genomes were PCR-amplified using genome-specific primers followed by validation ofgenome assignments using nullisomic-tetrasomic aneuploid wheat lines. Exonic sequences were aligned using the Muscle (Edgar 2004) and the proportion of divergent sites in A-B, B-D and A-D genome comparisons was estimated using libsequence library (Thornton 2003). The average similarity level between homoeologous copies of genes was 94% (Supplemental Figure S11). The proportion of genes that showed similarity levels higher than 98.5% between the wheat genomes was 5.7%. This result suggests that by using 98.5% or higher similarity threshold 94.3% of transcript assembliescan be accurately selected for homoeolog-specific gene prediction.

Validation of acquired exonic sequences by RT-PCR

The total RNA was isolated from several tissues of hexaploid wheat (leaves and roots of 2-week seedlings, and pre-anthesis spikes) using RNeasy kit (Qiagen). RT-PCR reactions were performed according to manufacturer protocol using SuperScript® III and Taq DNA polymerase (Invitrogen). Template for RT-PCR reaction was prepared using poly(T) oligonucleotide primer. For validation one of the exon-specific oligonucleotide primers was designed to a sequence within an acquired exon. In case if acquired exonic sequence was present in cDNA, we expected to obtain a PCR product of certain length. Twenty-six pairs of primers (72%) out of 36 produced PCR products of the expected size suggesting that the majority of our bioinformatic predictions were correct. The remaining primers failed to produce PCR product (8 primer pairs) or PCR product of expected size (2 primer pairs). The failure of these primers can result from 1) errors in gene structure prediction; 2) absence of a particular transcript isoform in isolated RNA sample; 3) low level of isoform expression in isolated RNA sample. The results of experimental validation are provided in Supplemental Table S16.