Construction of RNA-Seq Libraries

Methods

RNA isolation

Animals were synchronized by treating a mixed population with bleach to collect embryos. The embryos were incubated without food, causing the larvae to arrest at the L1 stage upon hatching. Starved L1 larvae were plated in the presence of food and raised at 25 degrees Celsius. Staging was determined by examining a subset of animals under Nomarski optics at intervals for vulval and germline development. Animals were collected at the mid-L2, mid-L3, mid-L4, and young adult (pre-gravid) stages and washed in S basal buffer. Exact times for growth after plating of starved L1s (all at 25 degrees C) were as follows: (1) L2 worms were grown for 17.75 hours (Pn.p cells visible but not divided, gonad just starting to proliferate); (2) L3 for 26.75 hours (Pn.p cells divided once or twice; gonad just starting to turn up); (3) L4 for 34.25 hours (vulvae are in Christmas tree stage gonad has passed bend, sperm are present); and (4) Young Adult for 46 hours (vulvae fully formed and oocytes present in gonad, but no embryos). The worms were centrifuged on a sucrose cushion to remove debris and recovered in several washes of S basal. Total RNA isolation was performed by adding 4 volumes of Trizol (Gibco) per volume of packed worms. After two rounds of freeze/thaw and vortexing, the slurry was extracted with 2 volumes of chloroform. The aqueous layer was precipitated with one volume of isopropanol and the pellet washed with 70% ethanol. The pellet was resuspended in water and the concentration determined by UV spectrometry. RNA quality was assessed by examining an aliquot run on an ethidium bromide stained gel for discrete ribosomal and transfer RNA bands, as well as an mRNA "smear".

Construction of RNA-seq libraries

For each library, a 10µg aliquot of DNaseI treated total RNA was separated into polyA+ and polyA- fractions using the MACSTM mRNA Isolation Kit (Miltenyi Biotec, Bergisch Gladbach, Germany). Double-stranded cDNAs were made from the polyA+ fractions using the Superscript Double-Stranded cDNA Synthesis Kit (Invitrogen, Carlsbad, California, USA) and random hexamer primers (Invitrogen, 25 µM). The quality and quantity of the resulting double-stranded cDNAs was assessed using an Agilent DNA 1000 series II assay (Agilent Technologies, Santa Clara, CA, USA) and Nanodrop 7500 spectrophotometer (Nanodrop, Wilmington, DE, USA). Aliquots containing ~200 ng amplified cDNA were each diluted with water to 100uL and sonicated for 5 minutes. Sonication was performed in an ice water bath using a Sonic DismembratorTM 550 (Fisher, Ottawa, Ontario, Canada) with a power setting of "7" in pulses of 30 seconds interspersed with 30 seconds of cooling. Total sonication times refer to active sonication time only and do not include the rest periods in between each pulse.

The resultant cDNAs from all four libraries were size-fractionated on 8% polyacrylamide gels, and the 100 to 300 bp fractions excised. Gel-purified cDNA products were modified for Illumina sequencing using the Illumina genomic DNA sequencing kit as follows: Size-selected cDNAs were end-repaired by T4 DNA polymerase and Klenow DNA Polymerase, and phosphorylated by T4 polynucleotide kinase. The cDNA products were incubated with Klenow DNA Polymerase to generate 3' Adenine overhangs followed by ligation to Illumina adapters, which contain 5' Thymine overhangs. The adapter-ligated products were purified on Qiaquick spin columns (Qiagen), then PCR-amplified with PhusionTM DNA Polymerase using Illumina's genomic DNA primer set. PCR products were purified on Qiaquick MinEluteTM columns and the DNA quality assessed and quantified using an Agilent DNA 1000 series II assay and Nanodrop 7500 spectrophotometer and diluted to 10nM. Cluster generation and sequencing was performed on the Illumina cluster station and 1G analyzer following manufacturer's instructions. Sequences were extracted from the resulting image files using the Firecrest and Bustard applications run with default parameters. Read lengths were 36 bases.

Databases for sequence alignment

Summary:

Reads were aligned to the following databases:

- WS170 C. elegans genome

- a nonredundant splice junction database composed of

* predicted splice junctions including WormBase, Twinscan and Genefinder predictions, termed “ss170”

* alternative splice junctions created by novel combinations of those donors/acceptors in ss170, termed “ss170c”

* novel splice junctions created from all combinations of donors and acceptors predicted by a permissive Genefinder run, “ps170”

- splice leader database created by prepending all SL sequences to a nonredundant set of

* all predicted acceptors (WormBase, Twinscan, Genefinder)

* acceptors predicted by Genefinder run in permissive mode

Details:

Because the sequence reads are derived ultimately from the processed RNA transcripts, we created databases to reflect these processing events as well as the genome. The databases against which the reads were aligned consisted of the following:

Genomic DNA. We used the WS170 version of the C. elegans genome (Rogers et al. 2008). To distinguish unique regions from repeated sequences, we assigned each base a representation value as calculated by the number of 24mers from the genome that base participates in, considering all 24mers from both strands. Unique sequence has a representation value of 48, with participation in two unique 24mers, one from each strand at that position.
Predicted splice junctions. A non-redundant database was created of 70mers centered on all predicted splice junctions called “ss170”. These predicted splice junctions included those in the current WormBase gene models as well as those in predicted pseudogenes (2,089 total; 494 are confirmed by EST data) and those annotated as non-coding transcripts (395 total; 347 are confirmed by EST data); those predicted by Twinscan (Korf et al. 2001); and those predicted by Genefinder (P. Green, unpublished) using relaxed parameters designed to minimize the number of missed genes.
Alternative splice junctions. To look for novel combinations of the splice sites in the gene models above, we combined those donors/acceptors that were >29bp and <4kb apart and added those to ss170 to create a second database, termed ss170c.
Novel splice junctions. To look for splice sites that might not be incorporated into any of the three gene sets above, we used Genefinder (genefinder1.1) in a very permissive mode (-minintronlength 1 -minexonlength 1 -minorflength 10 -orfcutoff -0.5 -intron3cutoff -0.5 -intron5cutoff -0.5 -tablenamefile nemtables) as above to identify a large number of possible donors and acceptors. Combining all in-frame donors and acceptors in the set that are >39bp and <2000bp apart, we generated 84,386,278 possible splice combinations (98% of all WormBase annotated introns are >=40bp and <=2kb; 99% are <=4kb; 99.5% are <=5kb). Subsets of these possible splice junctions were used for each stage, based on above threshold scoring of the exons flanking the possible donors and acceptors to create a database of candidate novel splice junctions, ps170, for each stage.
Comprehensive nonredundant splice database. A nonredundant splice junction database was created from the ps170 and ss170c databases called ps170ss170c.
Splice leader database. To mimic sequences that would be created by appending splice leader sequences at the start of mRNAs, we created the list of all possible acceptor sites in the C. elegans genome (2,548,064), using GENEFINDER (genefinder-1.1) with permissive parameter settings (-minintronlength 1 -minexonlength 1 -minorflength 10 -orfcutoff -0.5 -intron3cutoff -0.5 -intron5cutoff -0.5 -tablenamefile nemtables; 3% of wb170 predicted splices are not identified using these parameters). We then added in any acceptors from currently predicted introns in WormBase (Genefinder + WormBase + Twinscan = 133,777 predicted, 8,725 confirmed) not already included in the above list and prepended each of the known splice leaders (SL1, SL2 … SL12) independently to all acceptors to create the splice leader databases (sl170sl170p) (Table 7).

Alignment of Illumina reads against the C. elegans genome and databases

Summary:

Reads aligned to the following databases:

- Genome

* MAQ starting with full 36mer, iteratively trimming by 2 bases on the 3’ end down to 24mers only retaining alignments with <=1 mismatch

* cross_match retaining all matches with score >=22

- Splice junction and splice leader databases

* cross_match retaining best in database matches, score >=24, <=2 mismatches

- Reads with polyT tails

* cross_match genome alignments were parsed to identify reads with at least 4 initial Ts that did not align to the genome

Details:

We used MAQ (Li et al. 2008) and cross_match (P. Green, unpublished) to align the sequences to the genome and the various databases, keeping only one alignment for each read. MAQ is fast and computes a mapping quality score related to the uniqueness and quality of the alignment. However, it retains only a single matching location for any read with multiple equally reasonable alignments in the genome. Cross_match was modified for these short reads to increase speed (P. Green, unpublished) but, with the parameter settings we used for both programs, required more cpu time than MAQ. However, it provided information about sites of multiple matches and was more flexible. We describe below how each was used against the various databases.

Genome. Using MAQ (version 0.6.0, parameters: –n 1) in an iterative fashion, initially all reads were searched against the C. elegans genome, retaining only matches for those reads aligning with <=1 mismatches. For those reads without an alignment with <=1 mismatches, two bases were trimmed from the 3’ end of the read and the resulting trimmed read was searched back against the C. elegans genome using MAQ. Reads aligned with <=1 mismatches were again retained as matches and for those without “matches”, two bases were trimmed from the 3’ end and searched against the genome etc. until the read had been trimmed to 24 bases. Only those reads with at least 24 bases were retained as matches. Reads with a mapping quality score of >=10 were termed “high quality” mapped reads. Less than 3% of those reads placed with mapping qualities >=10 were determined to have more than 1 alternative placement in the genome. All reads were also aligned to the genome using cross_match (version 1.080426, -minscore 22 -gap1_only -gap_init -10 -gap_ext -10 -masklevel 101). With these parameters, cross_match finds local alignments (another difference from MAQ). However, cross_match can also identify global alignments using the –globality parameter.
Alignments to splice junction and spliced leader databases. To identify valid alignments against the novel and currently predicted splice junction database (ps170ss170c) and splice leader database (sl170sl170p) (identical methods were used for alignments to the artificial splice junction and splice leader databases described below), all reads were aligned using cross_match (version 1.080426, -minscore 22 -gap1_only -gap_init -10 -gap_ext -10 -masklevel 101). For alignments to be kept, the alignment must start at base 1 or 2 of the read, contain <=1 discrepancy in the first 24 bp, <=2 total mismatches in the aligned portion of the read and a cross_match score >=24. Additionally, alignments to the splice junction database were only kept for reads where that read had as its best in database match a score of >=2 more than its next best-in-database match and >=2 better than its next best-in-genome match. For each SL or splice junction alignment, the “score” for that read against the splice junction or SL is the difference between the best in SL/splice junction database alignment cross_match score and the best in genome score. However, if there is no alignment to the genome for a particular read, then the “score” is the amount of overlap into the SL or across the splice junction (the “shorter” side of the alignment).
Aligning reads with polyT tails.To produce an initial list of possible polyA tails, we identified all reads beginning with at least 4 Ts where, based on the cross_match alignments, the portion of the read following the Ts was aligned to the genome and the initial Ts did not match the genome at that location. We required <=3 discrepancies and a cross_match score of >=24 to retain an alignment for consideration. For all reads with at least 4 initial Ts, we identified the best in genome alignment (requiring the best-in-genome alignment have a cross_match score >=2 better than the next best alignment) and ask in that best-in-genome alignment if at least 4 of the initial Ts were not found in that genome alignment. For those reads meeting those criteria, the best-in-genome match is then compared to the best-in-splice junction and best-in-splice leader alignments. The best-in-genome match is retained only when its score is better than the score found in searching against either of those databases.

Defining final placement for each read

Summary:

Initial coverage per genome base is estimated using results of the alignments using MAQ to identify the best in database placement per read:

- combine polyT-trimmed searches with sl170 searches retaining the longer alignment per read

- combine polyT/sl170 placement with the best in genome placement retaining longer alignment per read

Final database placement per read is determined after cross_match alignments to all databases with the read placed in the best in genome alignment position with the following exceptions:

- polyT alignment is kept if it is better than best in genome, best in splice junction and best in splice leader database

- splice leader alignment is kept if it is significantly better than the best in genome and best in splice junction database

- splice junction alignment is kept if it is best in the splice junction database and better than the best in genome alignment

Details:

After MAQ has been run against the various databases, we create an initial coverage estimate per genome base by combining the various database searches in the following way. First, we combined the results of the polyT-trimmed searches and those searches against the trans-spliced leader/predicted acceptor database (sl170). For each read, the longer alignment was kept whether it was to the sl170 database or with the polyT trimmed with only one exception: because the 3’ end of the SL1 sequence (TTTGAG) contains a polyT, in the case when there was a perfect match against the genome after trimming the polyT, but a single mismatch against the sl170 database, the polyT match was retained. Second, we combined these results with the alignments to the genome (down through the 24mer alignments). For each read with an alignment to the sl170 database, the sl170 alignment was retained if and only if there was no alignment to the genome or the length of the alignment to the sl170 database equaled the length of the alignment to the genome, but there was a one base mismatch in the alignment to the genome. For each read with an alignment to the genome after trimming the polyTs, the alignment the polyT-trimmed alignment was kept only if it was longer than the alignment of the read to the genome. Third, we combined these results with the results of the alignments against the splice junction database (ss170). If there was an alignment to both the sl170 and ss170 databases, the sl170 alignment was only kept if the alignments were of equal length and there were zero mismatches against the sl170 database and one mismatch against the ss170 database. When there was an alignment both to the ss170 database and to the genome, the ss170 alignment was only retained when there were at least two basepairs more matching in the splice junction database than in the genome. When there was an alignment both to the ss170 database and to the genome after polyT trimming, the polyT trimmed alignment location was only kept if it was longer than the alignment to the splice junction database. Using the coverage per genome base, we create the coverage-based ps170 database per stage as described above.

After all cross_match searches have been run we obtain the final placement per read by combining the results of these various database searches in order to retain the best placement for each read. We assume a read is placed in the genome with the following exceptions. As noted, for a read to be categorized as identifying a polyT it had to have a score of >=2 better than the next best in polyT database alignment and a score of >=2 better than the best hit to the genome. Similarly, we required the score of the alignment to the polyT database have a score of >=2 better than the best alignments to the splice leader and splice junction databases. For a read to be categorized as a valid alignment to the splice junction database, as noted it had to have a score >=2 better than the next best in splice junction database alignment, and a score of >=2 better than the best in spliced leader and best in genome alignment. Finally, for a read to be categorized as a spliced leader, the score of the best alignment to the splice leader database has to be >=5 better than the next best score, >-=5 better than the next best in genome score, and >=5 better than the best splice junction score.The larger score for splice leader assignment eliminated false positives arising from highly expressed genes, perhaps stemming from the fact that the consensus sequence preference for the last three bases of the upstream exon preceding a donor site is RAG, which is the same as for the TSL. Reads with ambiguous placement, that is, reads not exceeding one of the scores above were not placed.

Estimating false discovery rates and false positive rates

Summary:

- genome alignments