1

SUPPLIMENTAL INFORMATION

Sample Collection and Preparation and Sequence Quality Checks.

Seawater (4 L) was collected by Niskin bottles from each depth and pre-filtered through a sterile GFA glass fiber filter (Whatman, NJ, USA), for collection of particle attached Archaea and Bacteria and microbial Eucarya, before passage through a sterile 0.22 µm HV Durapore (Millipore, MA, USA) filter, for collection of the free-living Bacteria and Archaea. Filters were immediately placed in liquid nitrogen until being transferred to a land-based laboratory, where they were stored at -80°C until extraction of community DNA using the MoBio UltraClean Soil DNA Kit (MoBio Laboratories, Inc. CA, USA). DNA extracted from both the Whatman and Durapore filters for the same depths was pooled, and ~5ng of the pooled DNA used as template in PCR’s.

Each PCR reaction was performed in 100µl volumes and contained 5U AmpliTaq Gold (Applied Biosystems, CA, USA), 1 x PCR Amplitaq Gold buffer (Applied Biosystems), 2.5 mM MgCl2, 400 µM of each deoxynucleotide and 400 nM of each primer. The reaction mixtures were held at 94°C for 10 min followed by 24 amplification cycles of 94°C for 40 sec, 55°C for 40 sec and 72°C for 40 sec, with a final step at 72 °C for 2 min. Amplification products from 3 separate reactions for each sample were pooled and purified in a MinElute PCR purification kit (Qiagen, CA, USA).

A 3 base pair (bp) key code, in which each key differs by at least 2 bases, was added to the forward primer sequence to enable subsequent bioinformatic identification of sequences originating from each depth. We did use two homopolymers (AAA, TTT) as key codes to identify sequences belonging to different samples. It has been reported that 454 pyrosequencing is susceptible to higher sequencing error rates associated with homopolymers, or consecutive bases such as AAA. Although the use of these codes may have resulted in increased error at this point of the sequence we believe it has not impacted on the data itself. As all primer and key codes were removed from quality checked sequences before analysis the homopolymer codes themselves do not form part of the data. Furthermore, we do not believe their use resulted in less sequences passing the quality control measures, as we identified sequences that had perfect matches to the key code followed by the primer. For example, in an instance where homopolymer sequencing error resulted in AAAA instead of AAA in the barcode region, our quality control search would just identify the last 3 A’s followed by a perfect primer sequence, thus the sequencing error would have no effect.

Three sets of reactions were run for each depth with the modified primers. Purified PCR products were pooled (10 µl from the 10m sample, and 5 µl from 800m and 4400 m samples) and sent to 454 Life Sciences for pyrosequencing in a GS-FLX sequencer. This machine returns an average sequence length of 240 base pairs (bp). Since the average length of our assay fragment including primers was 122 bp, our assay was not affected by the decrease in quality of sequencing towards the maximum read length of the machine. Data returned from 454 Life Sciences was subject to a systematic check to remove low quality reads. We eliminated (i) sequences that did not perfectly match the 3 bp key code and PCR primer at the beginning of the read; (ii) sequences that did not perfectly match at least the first 10 bp of the distal primer; (iii) sequences that contained any undetermined nucleotides (N); and (iv) sequence reads of <50bp after the removal of the proximal and distal primers. Use of these quality control measures greatly increases resultant read accuracy in pyrosequencing assays (Huse et al., 2007). Quality checked sequences were separated into groups on the basis of their sample depth using their 3 bp key code. Both the proximal and distal primers were also removed from the sequences before further analysis. Unique sequences and their total abundances were identified (using the Unix shell command “sort | uniq –c”), and each unique sequence was labelled thus:

HOT169_(depth of origin)_(rank abundance)_(total abundance)

e.g., the unique sequence found in the sample from 10m which occurred the most times, 22 372, is labelled HOT169_10m_1_22372.

Sequence Analysis

Sequences were submitted to GenBank at the National Center for Biotechnology Information (NCBI) for BLAST analysis, with searches performed with an e-value cutoff of 10-5 (All GenBank+EMBL+DDBJ+PDB sequences, August 25, 2007, 5 826 003 sequences in database. To assign a Domain affiliation to each query sequence, the best high scoring pair (HSP) from the resultant BLAST search was compared against the NCBI taxonomy database. The Domain of origin of the best HSP was taken as the domain of origin of the query sequence. The data was separated into subsets (i.e., Eucarya, Bacteria, Archaea) according to these associations. Two databases were created for each domain at each depth, the “all sequence”, composed of all sequences, and the “unique sequence” database, composed of representative unique sequences. DNA sequences from each unique HSP were retrieved from GenBank. Where possible, phylogenetic affiliations were parsed from GenBank entries. Further steps were taken to taxonomically identify HSPs with no phylogenetic classification. For Bacteria and Archaea, HSP sequences were aligned and classified using tools on the Greengenes website (DeSantis et al., 2006) ( NCBI taxonomy was used to identify Bacteria affiliations and Hugenholtz taxonomy for Archaea (in order to preserve the commonly used marine group definitions in Archaea). These steps were repeated to generate phylogenetic affiliations for 16S rDNA tag data from Sogin et al., (2006), 16S rDNA clone library data generated from a depth profile at Station ALOHA (DeLong et al., 2006), and parsed 16S rDNA sequences from the Global Ocean Survey datasets (Rusch et al., 2007). Eucarya HSP sequences were manually aligned in ARB ( and incorporated into a tree constructed from 8 467 Eucarya 18S rDNA sequences using parsimony methods. The source of origin of clones was determined from GenBank submission or manuscripts relating to GenBank submissions. The percentage identity tag sequences displayed when compared to GenBank HSPs were parsed from BLAST outputs. Theoretically this analysis may be skewed to high percentage similarities reported by short BLAST hits, but examination of the results shows that skewing has not impacted on the large majority of comparisons. Over 90% of BLAST hits encompassed the full length of the amplified fragment, and those that didn’t accounted for only 10% of the total diversity and 5.1% of the total abundance.

Diversity analysis

DOTUR analysis

Unique sequence databases were aligned using MUSCLE v3.6 (Edgar 2004) with parameters –maxiters =2 and –diags. Pairwise distances were calculated for each alignment using default parameters implemented in the DNADIST program of the PHYLIP package (Felsenstein 1993). These distance matrices were used as input into the program DOTUR (Schloss and Handelsman 2005), which was used to compute rarefaction curves at different levels of phylogenetic similarity.

Diversity estimation

Clustered data from CD-HIT was used to calculate parametric and non-parametric richness estimates and richness lower bounds for a variety of phylogenetic cut-offs. ). For each sample, we converted observed OTU abundance counts to frequency counts f1, f2, f3, …, where fi = the number of OTU’s represented i times in the sample.

There are four main families of methods for estimating total OTU richness based on frequency count data: parametric maximum likelihood (Bunge and Barger 2008), nonparametric coverage-based (Chao, 2005), nonparametric maximum likelihood (Böhning and Schöon 2005), and parametric Bayesian (Barger and Bunge 2008). Here we apply and compare the first two, as the latter two are undergoing rapid development and software implementations in particular are not yet readily available.

In the parametric maximum likelihood approach, we postulate a parametric stochastic abundance distribution for the (unknown) OTU population abundances. This yields (mathematically) a postulated parametric distribution for the observed frequency counts; which we then fit to the data by maximum likelihood; the fitted distribution in turn yields an estimate of f0 (the number of unobserved OTU’s) and hence of the total richness, and associated standard errors, goodness-of-fit assessments, etc. We tested 7 candidate abundance distributions: the single point mass or equal abundances model; the gamma, inverse Gaussian, lognormal, and Pareto; and mixtures of two and of three exponential distributions. Due to the large size and high diversity of the data analyzed here, the mixtures of two and three exponential models usually provided the best fit, as expected (Bunge and Barger 2008). Note that frequency count data of this type typically displays a large number of rare OTU’s and a small number of highly abundant OTU’s, resulting in the curve shown in Figure S9, which displays the data for Eucarya sampled at 4 400 m, up to frequency = 100 (the actual maximum frequency is 3 618).

For some subsets of data, no known parametric distribution fit the complete dataset, and in such cases we split the data into a lower-frequency component, consisting of the frequency counts up to a right truncation point, denoted τ, i.e., f1, f2, …, fτ, and a higher-frequency component fτ+1, f τ+2, fmax, where max denotes the maximum frequency observed in the data. We then fit the model to the lower-frequency component, and add in the higher-frequency counts to obtain the final estimate of total richness. We fit each of the 7 candidate models at every right truncation point, using custom software written in Maple® (available from author John Bunge) and parallelizing the computations on a supercomputer, and select the “best of the best,” assessing goodness-of-fit via six variants of the Pearson chi-square statistic and two variants of the Akaike Information Criterion (AIC). We seek a combination of high τ, so as to fit the parametric model to as much of the data as possible, and good fit. In this example τ = 66.

The nonparametric coverage-based approach first estimates the sample coverage – the proportion of the population represented in the observed data – nonparametrically, and on this basis makes a preliminary richness estimate, which is then adjusted upward (again nonparametrically) based on the coefficient of variation of the data. This is the Abundance-based Coverage Estimator or ACE, and it has two versions: ACE and a high-diversity variant ACE1 (which involves a higher upward adjustment); the latter is selected when the coefficient of variation exceeds 0.8 (Chao, 2005), which was the case for every dataset analyzed here. We computed these estimates using the software SPADE (Shen et al., 2003). These procedures also require a right truncation point τ. The default value of τ recommended for ACE and ACE1 is 10. In high-diversity data these estimators may behave erratically for high values of τ: Figure S10 compares the values of the parametric estimates, here based on the mixture-of-three-exponentials abundance distribution, to ACE and ACE1, with associated 95% confidence bands (+/- 1.96*SE), as τ increases from 10 to 100, again for the data from Eucarya sampled at 4400m.

Note that (i) the parametric confidence intervals are reasonably self-consistent, and stabilize after an initial period of large error (due to random fluctuations in the frequency count data); (ii) while ACE1 is selected in every case by the coefficient-of-variation criterion, it drifts to infinity as τ increases, although at the (default) recommended τ = 10 its confidence interval (27 875, 29 068) almost overlaps the parametric interval (29 671, 45 821) at the parametrically selected τ = 66, and in fact its confidence intervals overlap the parametric intervals for τ from 12 to 27; (iii) while ACE is not selected by the coefficient-of-variation criterion, its confidence intervals overlap the parametric intervals for τ from 17 to 66. We have little guidance for selecting an optimal τ for the nonparametric estimators, and they differ greatly depending on τ, while for the parametric procedure we can select τ based on goodness-of-fit assessments, and the parametric estimates are reasonably consistent as τ varies. For this reason (among others) we prefer the parametric method for high-diversity data of this type, but we report and compare the parametric results at “optimal” τ along with the nonparametric results at τ = 10.

Primer Analysis

The phylogenetic breadth of the primer set was analysed in-silico by comparison to the SILVA (( ribosomal RNA database that contains representatives of the Archaea, Bacteria and Eucarya. Firstly, we took a subset of the SILVA database that contained all sequences >1450 bp in length, as this was the most likely subset to contain both our primers, which are situated at the 3’ end of the small subunit rRNA gene. Importantly however we could not be certain the primer sites were actually contained in these sequences. Primers sites used for sequence amplification are often (rightly) removed prior to data submission. In total 119 794 sequences were analysed. Taking into account the Tm of the primers and the annealing temperatures used in this assay, we ascertained that any sequence with specificity to the 3’ 13bp (of 14) of the forward primer and the 3’15bp (of 18) of the reverse primer would be considered as a match. This of course would be a conservative estimate as primers need not match perfectly across their entire length to efficiently amplify products (Sommer and Tautz 1989; Kwok et al., 1990). We searched for identical matching primer sites among these sequences using the Unix command ‘grep’. A total of 72 533 sequences contained perfect matches to both our forward and reverse primers. The taxonomic breakdown of this analysis is provided in Supplementary Table S4. Classification of taxa the primers would be expected to amplify based on hierarchical taxonomy used by NCBI indicate that at the 3rd rank (generally equivalent to Class), 4th rank (~ Order), 5th rank (~Family), and 6th rank (~ Genus) the primers match representatives of 84%, 84%, 83%, 85% of Archaea, 83%, 84%, 84%, 80% of Bacteria, and 90%, 83%, 82%, 80% of Eucarya (Note the NCBI Eucarya taxonomy contains many intermediate “no rank” categories such that 8th rank may be equivalent to any rank from class to genus). Thus our analysis reveals the primer set used to amplify ribosomal tags in this study would be expected to amplify fragments from > 80 % of known phylogenetic diversity across all 3 Domains (as represented by near full length sequences in the SILVA database). Indeed our results show they also amplify fragments from a large number of previously unknown phylogenetic lineages as well.

Comparison to alternative Bacteria and Eucarya primer sets

To determine whether our primer set missed significant sub-sets of the actual microbial population in our samples, (as opposed to known total known microbial diversity) we generated and sequenced clone libraries for Bacteria and Eucarya from the same DNA samples as used in the study, but using alternative primer sets. Lack of suitable primers precluded the production of Archaea libraries. The primer sets used may have their own biases but libraries were designed to encompass the stretch of ribosomal DNA that includes the original primer sites used in this study. This would enable the identification of primer mismatches.

The Bacteria assays utilized universal primer 1392F (as used in this study) and Bacteria specific primer 23Sr 5’-GGGTTBCCCCATTCRG-3’ (Fisher and Triplett 1999). This assay amplifies the 3’ region of the 16S rRNA gene sequence and the internal transcribed spacer region. The Eucarya assay utilized Eucarya specific primers 18Sf 5'-ACCTGGTTGATCCTGCCAG-3' and 18Sr 5'-TGATCCTTCYGC AGGTTCAC-3' (Moon-van der Staay et al., 2001) which amplify near full length 18S rRNA gene fragments. In total we performed bi-directional sequencing on 241 16S/ITS clones (87 from 10 m, 85 from 800 m and 69 from 4 400m) and 119 near-full length 18S rRNA clones (22 from 10 m, 36 from 800 m and 61 from 4 400 m). We excised the region corresponding to our V9 assay (including removal of primer sites), grouped sequences based on 100% identity and BLAST these sequences against our tag sequence database. Results of this analysis are displayed in Table S5. In total only 21 of the 361 sequences did not match perfectly one of our tag sequences. Furthermore, of these 21, 16 contained only a single mismatch. This single mismatch could be the result of sequencing error and would be accounted for in the grouping of our tag abundance data diversity analyses. The 5 sequences generated by alternative primer sets which contained more than 1 bp mismatch to any tag sequence were all from the Eucarya and included 2 sequences from 10m which contained 5 and 12 mismatches to closest tag sequences, 1 from 800 m which contained 9 mismatches to its closest tag sequence, and 2 from 4400m which contained 2 and 5 mismatches to closest tag sequences (Table S5). We also manually examined the primer sites corresponding to our V9 primers to identify any mismatches in the independently amplified sequences. We were able to determine that 9 other clones had mismatches to either the forward or reverse primer, however all these clones contained perfect matches in our tag sequence database, again indicating that a perfect match is not required for efficient amplification, especially when the mismatch is located near the 5’ end of the primer (Sommer and Tautz 1989; Kwok et al., 1990) and melting temperature specificities are taken into account.