The internal transcribed spacer as a universal DNA barcode maker for fungi
Fungal Barcoding Consortium
Introduction
The absence of an accepted DNA barcode for Fungi is a serious limitation for the inclusion of this kingdom, the second most speciose in the Eukaryotes, in this rapidly developing field. Fungi play many critical roles in planetary ecosystems and the human economy; they include symbionts, sources of food and industrial chemicals, agents of human, plant and animal disease, toxin and antibiotic producers, and vital participants in nutrient cycling. DNA barcoding is the use ofstandardized DNA sequences to identify species from all kingdoms of Eukaryotic life. Barcodes are 500-700 bp sequences obtained from a single polymerase chain reaction (PCR) amplification, flanked by conserved regions that allow the useof universal primers. Reference barcode sequences must have clearly documented voucher specimens, be identified by acknowledged taxonomic experts, and be accompanied by sequence chromatograms that validate the sequence. To be effective as a barcode, interspecific variation should exceed infraspecific variation and identification is most straightforward when a sequence is constant and unique to a single species (Hebert et al. 2003a, b, Letourneau et al. 2010).
Ideally, the DNA barcode region used would be a single locus for all kingdoms. The gene encoding the mitochondrial cytochrome c oxidase 1 (CO1, or cox1) was proposed as the DNA barcode region in animals (Hebert et al. 2003a, b) and then adopted by the Consortium for the Barcode of Life (CBOL) as the default barcode for all groups (Schindel and Miller 2005). In plants, however, CO1 had limited use for differentiating species across a wide range of taxa and the a combination of rbcL + matK was adopted for a two locus barcode system (CBOL Plant Working Group 2009; Kress et al. 2009). This sets precedence for reconsideration of the default fungal barcode locus.
The performance of CO1 as a DNA barcode for fungi has been inconsistent (Seifert 2008, 2009). Although it functions well in some groups, such as Penicillium (Seifert et al. 2007), reliable primers exist and species resolution is adequate (67% in a speciose, presumably young lineage), in the few other fungal groups examined experimentally, results were disappointing. Amplification of CO1 is sporadicbecause of extreme length variation resulting from the unpredictable presence of multiple introns (Rossman 2007, Seifert 2007, 2008, 2009); multiple copies of different lengths and variable sequence exist in some groups(Gilmore et al. 2010).Although there are degenerate primers that work with many Ascomycota (Gilmore et al. 2010), their performance is difficult to assess because of amplification failures probably related to length variation. Because most fungi spend most of their life cycle in a hidden, microscopic form, the only way they can be reliably monitored in environmental surveys is by using robust and universal primers, something that appears impossible with CO1.
The nuclear ribosomal RNA genes have two decades of use in fungal diagnostics and phylogenetics (Begerow et al. 2010) and are the most frequently discussed alternatives to CO1(Rossman 2007, Seifert and Crous 2008, Eberhardt 2010, Schoch et al. 2011).The small subunit (SSU) is commonly used in phylogenetic studies, and although it is often used as a species level marker in bacteria (Stackebrandt Goebel 2004), it has few ‘hypervariable domains’ and is seldom used for species identification in fungi. The large subunit (LSU) regions sometimes discriminates species well either on its own or in combination with the internal transcribed spacer (ITS). For the ascomycetous yeasts, the D1/D2 region of LSU was used for characterizing species long before the concept of DNA barcoding was promoted (Kurtzman and Robnett 1998, Suh et al. 2006); the comprehensive database allows most yeasts to be identified rapidly near the species level.
The ITS is the most frequently sequenced genetic marker for fungi (Table 1 in Begerow et al. 2010), is widely used for species identification in many fungal lineages, and already functions as a de facto barcode (Begerow et al. 2010, Kõljalg et al. 2005, Eberhardt 2010, Seifert 2009).
Currently, there are ~172,000 reasonably full-length fungal ITS sequences in GenBank, 56% identified with a Latin binomial, corresponding to ~15,500 species and 2,500 genera of fungi, generated from >150 countries on all continents and derived from ~11,500 scientific studies from ~ 500 different scientific journals (H. Nilsson, unpubl.).
Insome groups of fungi, there are limits to the barcoding utility of ITS.Low interspecific variability, especially in several groups within the filamentous ascomycetes (Pezizomycotina) makes other markers, such as elongation factor a more popular choice (Gazis et al. 2011). These includespecies-rich ascomycetous hyphomycetes genera with shorter amplicons, such as Cladosporium (Schubert et al. 2007), Penicillium (Skouboe et al. 1999) and Fusarium (O’Donnell & Cigelnik 1997). In addition it is well known thatseveral species often share a common ITS sequence. In contrast ITS paralogs with different sequences could also interfere with barcoding success. Multiple divergent copies of ITS has been reported in some species, including the Ascomycete Fusarium (O’Donnell & Cigelnik 1997) and basidiomycete Laetiporus(Lindner & Banik, 2011). In other cases high infraspecific variability makes amplification close to impossible (often but not exclusively in the Basidiomycota, REFs). SOMETHING ABOUT HYPERVARIABLE ITS HERE?
Protein-coding genes are now widely used in mycology, both for higher-level phylogenetic studies, and for species level diagnostics. There has been little standardization, which limits the usefulness of these markers as universal fungal barcodes. Specialized identification databases for different genera often employ different markers, e.g. translation elongation factor 1-α for Fusarium (O’Donnell et al. 2010) and β-tubulin for Penicillium (Frisvad & Samson 2004). Among other protein-coding genes, the largest subunit of RNA polymerase II (RPB1) has not yet been explored as a potential fungal barcode, but has promisebecauseit is ubiquitous in eukaryotes, has a slow rate of sequence divergence, and is typically present as a single copy within a genome (Tanabe et al. 2003). RPB1 has shown utility in some phylogenetic studies in the zygomycota and Microsporidia (Cheney et al. 2001, Tanabe et al. 2002, Tanabe et al. 2004, Liuet al. 2006) and protists such as Foraminifera (Longet and Pawlowski 2007). In Ascomycota protein coding genes (and RPB1 in particular) have been shown to have a superior phylogenetic informative profile compared to ribosomal genes (Schoch et al., 2009). Primers have been developed for several groups as part of the Fungal Tree of Life project (AFToL) (aftol.org). RPB1 also one of the xxx loci chosen in the second phase of the AFToL project (REF?) .
This paper presents the results of a multi-laboratory, multinational initiative with the goal to establish the standard DNA barcode for Fungi. We compared the barcoding utility, based on probability of correct identification (PCI), of the three ribosomal regions (ITS, LSU and SSU), and one representative protein coding gene, RPB1, based on newly generated sequences from 951specimens or cultures representing *** species, covering *** or the 17 major fungal lineages (Fig.1). All fungi were authoritatively identified by taxonomic experts. Contributors used the accepted standard primers for these markers used by the Assembling the Fungal Tree of Life project (aftol.org) and submitted sequences to a custom-built database for analysis ( Some contributors contributed sequences of two additional optional genes of potential interest as supplementary barcodes. The second largest subunit of RNA polymerase II (RPB2) was explored by AFToL Project and, like RPB1, primers exist that amplify several lineages in Fungi (James et al. 2006). The second gene encodesfor a mini-chromosome maintenance protein that is essential for the initiation of eukaryotic genome replication(MCM7), and was chosen based on work by Aguiletaet al. (2008) and Schmitt et al. (2009).
Materials and Methods
Database
We used a Biolomics database (Robert and Szoke 2011)to set up more than 2600 individual samples covering the major lineages of Fungi provided by X contributors. A graph representing the variability of data contributed is added in the Appendix 6. Figure 1 represents the main lingeages of Fungi with those analyses in this study indicated in red. (We did receive data for Glomeromycota and Neocallimastigomycota but they were not used in the comparative analyses).
More details from Vincent
Analyses
Data sets were set up as follows. Closely related asexual and sexual names of species were coded under the same genus name. Data set were split in order to allow for more taxonomically targeted assessment of markers. The taxonomic labels used are explained in a dendrogram infigure1.
After removal of problematic sequences the first set of analyses had 416 isolates from Pezizomycotina, 89 from Saccharomycotina, 202 from Basidiomycota and 43 from the combined basal lineages for 4 markers: ITS, LSU, RPB1 and SSU. Two second sets of analyses were done on data sets comparing 3 markers: ITS, LSU and RPB1 run on 683 isolates of Pezizomycotina and ITS, LSU, SSU on 152 isolates of basal lineages (152 isolates). Finally a third run was done on a six marker comparison for a selection of Pezizomycotina, Basidiomycota and Saccharomycotina for the four markers in the first comparison plus the 2 optional markers, MCM7 and RPB2.
Probability of Correct Identification (PCI). Given a specific dataset, the computer calculated a probability of correct species identification (PCI) as follows. First, two types of sequence alignment between every pair of samples in the dataset were calculated. The alignment types were: (a) global alignment (the Needleman-Wunsch algorithm aligns the entire sequence length, with penalties for gaps at the alignment ends (Needleman & Wunsch, 1970)); and (b) semi-global alignment (using a variant Needleman-Wunsch algorithm that includes both ends of one sequence, finding the alignment with the highest score without penalizing end gaps in the other sequence; the algorithm also does the same for the other sequence, returning the alignment with the higher of the two highest scores). Thus, (a) global alignment matches the whole length of two sequences; and (b) semi-global alignment matches one sequence to a subsequence of the other, and then vice versa. Semi-global alignment provides a check on whether disparate sequence lengths degrade species identification: if they do not, global and semi-global alignment should give similar species identifications. All alignments in our study used the BLAST default DNA scoring system (Altschul, 1997; Altschul, 1999). The results (shown for semi-global alignment with BLAST scoring) were robust against the choice of alignment and the choice of alignment scoring: other choices using global alignment or the BLASTZ default DNA scoring system (Schwartz et al., 2003) produced similar results (not shown).
For the two types of alignment, the computer calculated the p-distance (the proportion of aligned nucleotide pairs consisting of differing nucleotides). The “sequence diameter” of a species is defined as the greatest p-distance between two samples from the species. Based on the sequence diameter, “monophyletically correct identification” of a species occurs, if for every sample in the species, no sample from another species lies within the sequence diameter. The corresponding “monophyletic probability of correct identification” (PCI) is the fraction of species correctly identified under this definition (Hollingsworth et al 2009). The number of correctly identified species divided by total number of species provided the PCI estimator for each sample. The Wilson score interval yielded 95% confidence intervals for each PCI estimate (Wilson, 1927).
In order to investigate the further visualize the differences between variable sequence within a species and those outside pairwise comparisons were complete for the same data set as the PCI estimate and plotted on a graph demonstrating the variation of the 4 genes initially chosen for this study (Fig. 3). (More details from Vincent, also a comparison for 6 genes)
Survey methodology or another method to express PCR success here.
Results and discussion
A cartoon of accepted fungal relationships and the lineages sampled in this study is presented in figure 1. We did not cover Glomeromycota as a substantial data set for that phylum already exists and amplification of RPB1 proved to be unsuccessful. ( mention other groups not covered). Discuss variety in Fungi brieflyhere. From the customized database containing submitted data we could recover 951 records with sequences present for ITS, LSU, SSU and RPB1. In order to allow for an efficient analysis the isolates were divided into four sets by taxonomic affinity and also reflecting the broad user populations of mycologists. The analyses were performed in several ways but a single selected analysis is presented below with graphs presented for each group and a combination of all four markers. The following abbreviations were assigned to the barcode markers investigated: I (ITS), L (LSU), M (MCM7), R (RPB1), P (RPB2), and S (SSU). The four datasets investigated were: (1) “ILR Pezizomycotina”; (2) “ILR Basal Lineages”; (3) “ILRS”; and (4) “ILMRPS”. The indicated markers were successfully recovered from all samples in each dataset. The ILRS dataset included 142 species with more than one sample and 84 species with a unique sample (the number of unique samples with a sequence not unique to the species was, for each marker: ITS, 2 samples; LSU, 6 samples; RPB1, 0 samples; SSU, 2 samples). A second set of analyses (Fig. 3) largely agreed with the rankings seen in the PCI analysis.
Based on overall performance in species discrimination, SSU can be eliminated as a candidate locus (Fig. 2). While the gene was reported to have minimal problems in PCR amplification, sequencing or alignment, it had the lowest species discrimination in Pezizomycotina (Fig. 2a) and Basidiomycota (Fig. 2b)and the second lowest in Saccharomycotina (Fig. 2c). In the Basal lineages (Fig. 2d), SSU showed good discriminatory power, on par with LSU and better than both ITS and RPB1. In an overall comparison SSU (Fig. 3) had almost no “barcode gap” when compared to ther genes. LSU showed levels of species discrimination between 65% and 80% for all groups except in the Saccharomycotina where it had the lowest discriminatory power of the single genes within that group (35%). For LSU, virtually no problems were encountered during amplification, sequencing, aligning and editing.And it had an improved gap between the variation seen ouside and inside species when compared to SSU.
Our main choice, based on overall analysis, was between the 2 remaining markers. ITS had a clear distinction in variation seen inside and outside designated specieswith a peak seen near 400 base pair changes for more than 2500 sequence pairs (Fig. 3). This was only bested by RPB1 which had a peak at 720 changes for more than 1000 pairs. When PCI is compared ITS had the most resolving power for species discrimination in the Basidiomycota (85%) but performed lower than RPB1 in the Pezizomycotina (70%). In the Saccharomycotina and the basal lineages, ITS showed lower discriminatory power than either RPB1 in the former and SSU and LSU in the latter. When taxa are combined overall, ITS PCI was second in resolving power behind RPB1 of the single genes tested. RPB1 consistently showed relatively high levels of species discrimination (comparable to the multi-gene combinations) in all the fungal groups except the basal lineages (Fig. 1). Of the single genes, RPB1 had the most resolving power for species discrimination in the Pezizomycotina (85%), but performed slightly lower than ITS and LSU in the Basidiomycota (75% versus 80% – 85%). In the Saccharomycotina, RPB1 showed the highest species discrimination (65%) of all the single genes tested in that group, but fell somewhat short of the multi-gene combinations. In the Basal lineages, RPB1 showed the lowest discriminatory power (50%) of the single genes in that group.
A number of additional problems were noticed in the four marker comparisons. Most of the lichenised Pezizomycotina in our database had to be excluded due to the fact that the SSU PCR primers disproportionately amplified the algal symbiont. Therefore three genes were chosen for comparison in order to increase the inclusion of lichenised fungi. This is shown in Appendix 1. The ILR Pezizomycotina dataset (Appendix 1a) included 179 species with more than one sample and 117 species with a unique sample (the number of unique samples with a sequence not unique to the species was, for each marker: ITS, 4 samples; LSU, 6 samples; RPB1, 0 samples); and the ILR Basal Lineages dataset (Appendix 1b) included 34 species with more than one sample and 50 species with a unique sample (for each marker, every unique sample had a sequence unique to the species). No noticeable difference in ranking of the four candidate marker genes could be found. Similarly, the four marker set for the basal lineages only yielded 43 isolates of. In this case the RPB1 gene had a disproportional low success of amplification for this group. A three marker set without RPB1 yielded a comparison of 152 isolates (Appendix 1b).
In order to test whether other single copy protein coding genes could act in a similar way to RPB1 an additional two genes were chosen for comparison. None of these yielded data from the basal lineages but a combination of remaining groups yielded 207 isolates for which 6 genes could be compared (Appendix 2). The ILMRPS dataset included 55 species with more than one sample and 23 species with a unique sample (for each marker, every unique sample had a sequence unique to the species). In this comparison the other protein genes behaved in a comparable way to RPB1 with RPB2 yielding the best results, followed by RPB1 and MCM7. One other reason we included MCM7 was to test behavior as potential barcode after its recent introduction as a broad phylogenetic marker.