Aedes aegypti Transposable Element Annotation Document

Definition of “copy”, “element”, and “family”

Within a genome, a TE consists of multiple copies generated by transposition events. A collection of these copies is an element and a family refers to a group of related elements in the same or diverse organisms that usually share similar structure or conserved amino acid sequences.

TEfam1 Statistics

This version of TEfam contains 1091 entries from analysis of the Aedes aegypti genome sequence (GenBank AAGE02000000). The 1091 entries include 4 tandem repeats which are not TEs, 12 unclassified interspersed repeats, and 1075 TE sequences. The 1075 TE sequences comprise 1062 Elements, 13 of which include subgroups (1062+13=1075). A small number of the TEs are “composite” sequences, which are marked as such. The naming of the hAT transposons did not follow the standard convention used in TEfam.

Aedes aegypti TE annotation Methods

Two complementary approaches were employed to annotate TEs. TEs were uncovered on the basis of TE structural features and sequence similarities to known TEs using approaches similar to those previously reported( Tu, 2001; Biedler and Tu, 2003; McCarthy and McDonald, 2003; Arensburger et al., 2005; Tubio et al., 2005). A newly developed program Repeatscout (Price et al. 2005) was also used to identify potential TE sequences on the basis of their repetitive nature in the genome. TEs uncovered using the above approaches were manually inspected/classified and deposited in TEfam, a relational database for submission, retrieval, and analysis of TEs ( showed at least 75% identity at nucleotide level are considered belonging to the same element within most familiesof the class II DNA-mediated TEs. Criteria for defining elements within the highly diverse class I TEs varied and they were similar to the previously selected criteria during An. gambiae TE analysis. Sequences of the Ty3/gypsy LTR retrotransposons are considered as belonging to the same element if they share at least 85% nucleotide identity along at least 400 bases in their coding region. Ty1/copia sequencesthat share at least 85% identity at the nucleotide level over at least 1000bp are considered belonging to the same element. Copies of Pao/Bel retrotransposons are considered as belonging to the same element if they show at least 70% identity at the nucleotide level in their coding sequences.Sequences of non-LTR retrotransposons (as well as Penelope-like elements) are considered as belonging to the same element if they share at least 70% nucleotide identity along the reverse transcriptase region.Two of the tRNA-related SINEs, Feilai-A and Feilai-B, are considered different elements because their 5’ end are distinct and they likely originate from different retrotransposition events. The varied criteria (% sequence identity) used to define an element within a family or subclass reflect the different evolutionary dynamics of these different TEs and our effort to be consistent with previous analysis of An. gambiae TEs. Detailed annotation methods are described below.

LTR-retrotransposons.

Ty3/gypsy

1. TBLASTn was used to search for sequences homologous to the pol region of representative elements belonging to the Ty3/gypsy group in the Aedes genome. The query sequences included at least two divergent elements from each lineage from the Ty3/gypsy group in Anopheles gambiae (from Tubío et al., 2005), and the elements Osvaldo (from Drosophila buzzatii), Ty3 (from Saccharomyces cerevisiae), Cyclops (from Vicia faba) and Cer1(from Caenorabditis elegans).

2. Those hits showing at least 30% of amino acid identity over at least 80% of the length of the query sequence were subjected to further analysis, to identify both LTRs of each sequence by means of Blast2. This allowed us to obtain several putative representative sequences for several new putative elements.

3. Additional Blastn searches were performed using as queries those Aedes aegypti representative sequences previously identified in step 2. This allow us to obtain an idea about the representation of each element in the genome, and to obtain consensus or reconstruction sequences for those elements where was impossible to obtain a good representative sequence in step 2.

4. Finally, taking into account previous works in the Ty3/gypsy group field, we considered different sequences as belonging to the same element if they share an identity of at least 85% at the nucleotide level along at least 400 base pairs in their coding sequences.

Ty1/copia

RT domain sequences from 4 Ty1/copia proteins of An.gambiae, and Drosophila melanogaster, and the previously identified zebedee1 element from Ae.aegypti, were used as queries in a tblastn search (e value cutoff = 1e-3) against the Ae. aegypti genome assembly. The sequences of all the hits plus 5kb of 5’ and 3’ flanking sequences were retrieved to form a sub-database of the Ty1/copia family. Full length elements were identified using the LTR_STRUC software (McCarthy and McDonald, 2003). Meanwhile, TEpipe was used to search the sub-database, get the representative protein sequences for each element (Biedler and Tu, 2003). All representatives were then used for phylogenic analysis. Additional blastn search was performed using those representatives to get all the sequences for each element, those share at least 85% identity of the nucleotide level with at least 1000bp long were considered belonging to the same element. ORF was determined by using ORF finder from NCBI.

Pao-Bel

1) Initially, tBLASTn searches were performed using as queries the first 4 domains of RT from Pao-Bel elements identified in Anopheles gambiae (a total of 11 elements taken from Repbase).
2) The sequences of the first 500 hits at each search were aligned using clustalX and a phylogenetic tree was constructed from this alignment by the Neighbor-Joining method using MEGA2. This allowed us to identify numerous Pao-Bel elements from Aedes aegypti.
3) Full-length elements were characterized using the LTR_STRUC software, confirming the result by visual inspection; or by identification of the two LTRs using blast2seq, and the ORFs using BioEdit.
4) Subsequent tBLASTn searches were done using as queries the amino acid sequences of the full ORF of several of these elements based on their position in the phylogenetic tree. We considered different insertions as belonging to the same element if they present a homology of at least 70% at the nucleotide level in their coding sequences.

Non-LTR retrotransposons in the Aedes aegypti genome

1) Non-LTR database formation

tBLASTn was performed on the contig version of the genome sequence using the reverse transcriptase (RT) region (approximately 230 aa) from representatives of 18 established non-LTR retrotransposons clades. Hits having an e-value of 1e-3 or higher significance were retrieved with 10 Kb of 5’ and 3’ flanking sequence and written to a file, the non-LTR database. BLAST, RepeatScout, RepeatMasker, and various computational tools developed in the Tu lab were used in combination for discovery and identification of non-LTR elements.

2) Discovery of non-LTR elements

RepeatScout was run on the non-LTR database to find individual non-LTR repeat families. TEpipe, an automated program developed in the Tu lab, was also used to survey the representation and copy numbers of non-LTR repeats in the genome. Representative copies were used for BLASTn against the genome sequence to retrieve multiple potentially full-length copies. Full-length copies, target-site duplications, and other characteristics were identified by visual inspection after alignment by ClustalW and viewing with ClustalX. After several families were identified, RepeatMasker was used to mask the database and allow verification of the remaining non-LTRs. All described non-LTRs were submitted to the TEfam Transposable Element Databse. The initial non-LTR database described above (but with the RT region only) was masked using defined non-LTR elements for all sequences with 70% or greater identity. tBLASTn of this masked database using the same queries and e-value above showed that less that 0.5% remained. The method for discovery here will miss those non-LTR elements that do not have an RT domain.

3) Family definition

A non-LTR family is generally defined as all copies having 70% or greater nucleotide identity to a query copy. Clades which elements were assigned to were verified by phylogenetic analysis.

Penelope-like elements were also annotated using the above method on the basis of their RT sequences.

DNA transposons

hAT DNA transposon

Isolation and identification hAT DNA transposons was done using a modified version the procedure used by Arensburger et al. (2005). A list of known transposase amino acid sequences belonging to the hAT superfamilly was compared to the initial assembly of the Ae. aegypti genome available at NCBI (http:\\ using the TBLASTN program. The resulting nucleotide sequences as well as adjacent base pairs were examined using custom written PERL scripts for sequences with the following characteristics: 1) presence of 8 bp. target site duplications, and 2) presence of terminal inverted repeats. These sequences were then examined using the program GENSCAN for open reading frames, and these were compared to known hAT superfamilly transposases using the BLASTP program. Elements with less than 90% identity in the translated open reading frames and with different terminal inverted repeats were considered separate elements.

Other DNA transposons

The following method is used to annotate one family of DNA transposons at a time.

1) Generate a sub-database that contains all copies of one family of DNA transposons (e.g., DD37D, DD41D, Tc1, PIF, etc). This is achieved by tblastn using as query peptide sequences of representative DNA transposons of the same family against the Ae. aegypti assembly (parameters, -e 1e-5, -b 10000, -v 10000 –F F). The blast output was processed using TEpost, TEcombine (Biedler and Tu, 2003). Blast hits were retrieved plus flanking sequences to generate this sub-database.

2) Identify unique representatives of full-length copies in the above database by running FINDMITE (Tu, 2001) and aligning similar copies using clustalw.

3) Run RepeatMasker using the elements identified in step 2 to mask the sub-database (Always use –div 25 unless otherwise noted, using cross-match search engine). Cycle back to step2, until no more full-length copies are found.

4) Go back to step2 and use clustalw alone to classify truncated element within the same family.

5) Check and see if there are redundant elements among the elements annotated using above methods by blast. Criterion is at least 75% identity at nucleotide level.

Helitron

Method is the same as above except FINDMITE was not run.

MITEs

A three-step approach was used.

1) Initially, FINDMITE (Tu, 2001) was run and sequences with obvious characteristics (TIRs, TSDs, short length) were annotated as MITEs.

2) Additional MITEs were identified among the Repeatscout (Price et al., 2005; K-mer seed of length 13) output from Brian Haas (TIGR). Candidate MITE sequences (having TIRs) were identified manually and they were used as query to retrieve related copies for Clustal alignment. If the alignment is consistent with the existence of TIRs (determined by manual inspection, as well as running self-blast, and FINDMITE) and TSDs and the MITE candidate does not have any coding potential, they are classified as different MITEs according to their TSDs. Elements with subterminal inverted repeats and atypical TSDs are grouped as otherMITEs.

3) Check and see if there are redundant elements among the elements annotated using above methods by blast. Criterion is at least 75% identity at nucleotide level.

Estimating TE copy number and genome occupancy

Estimation of TE occupancy (% in the genome) and TE copy number was obtained using RepeatMasker (Version 3.1.5) with WUblast as search engine under default setting. To filter out short sequences that inflate the copy number, we used the following criteria. If the TE query is 500 bp or less, the total length of the copy assigned by RepeatMasker has to be greater than 100 bp to be counted. If the size of the TE query is greater than 500 bp, the length of the copy has to be greater than 20% of the length of the TE query to be counted. This filter was not implemented when estimating TE occupancy. The presence of “composite” TEs, which contain smaller “simple” TEs, could complicate the effort to assign genomic sequences to a certain TE. To address this problem, known “composite” TEs were masked after the genome was masked with the smaller “simple” TEs.

REFERENCES

Z. Tu, Proc Natl Acad Sci U S A98, 1699 (Feb 13, 2001).

J. Biedler, Z. Tu, Mol Biol Evol20, 1811 (Nov, 2003).

E. M. McCarthy, J. F. McDonald, Bioinformatics19, 362 (Feb 12, 2003).

P. Arensburger et al., Genetics169, 697 (Feb, 2005).

J. M. Tubio, H. Naveira, J. Costas, Mol Biol Evol22, 29 (Jan, 2005).

A. L. Price, N. C. Jones, P. A. Pevzner, Bioinformatics21 Suppl 1, i351 (Jun, 2005).

A. F. A. Smit, R. Hubley, P. Green,