Evolutionary Analyses of the Human Genome
Wen-Hsiung Li, Zhenglong Gu, Haidong Wang & Anton Nekrutenko
Ecology &Evolution, University of Chicago
1101 East 57th Street, Chicago, IL 60637
Phone: 773-702-3104
Fax: 773-702-9740
Online supplement: Methodology and Data
Number of repetitive elements.
The draft sequence of the human genome was downloaded from UCSC web site. We used July 17th freeze. RepeatMasker, a program for identification of repetitive elements in a DNA sequence, was kindly provided by Dr. Arian Smit. RepeatMasker requires a library containing nucleotide sequences of repetitive elements. We used the most current release of the library downloaded from the Genetic Information research Institute (www.girinst.org). The identification of repetitive sequences ("masking") was carried out on a 96 CPU SGI machine located at the Argonne National Laboratory (www.mcs.anl.gov). Obtained information about repetitive elements was imported into MySQL(www.mysql.com) relational database environment. MySQL database files can be obtained by request from the authors ( or from the following web site http://pondside.uchicago.edu/~lilab/genome_suppl.html).
Repetitive Elements in Proteins.
A collection of human confirmed and predicted proteins was downloaded from ftp://ftp.sanger.ac.uk/pub/birney/humanproteome/. To ensure that this dataset does not contain duplicated or otherwise related sequences (e.g., isoforms) we used BLAST (Altschul et al. 1997) to compare the dataset against itself. We deleted all but one copy of entries that had significant (<E-80) matches and >50% overlap (using gene coordinated downloaded from ftp://ftp.sanger.ac.uk/pub/birney/genome/genome_location_hs4.list.gz). The resulting dataset can be obtained by request from the authors ( or from the following web site http://pondside.uchicago.edu/~lilab/genome_suppl.html)
To identify transposable elements we compared each protein sequence to a database containing nucleotide sequences of human repetitive elements downloaded from Genetic Information research Institute (www.girinst.org) We used program tblastn from the BLAST package. The output file can be obtained on request from the authors ()
Domain Sharing And Conservation
Protein sequences from yeast, worm, fruit fly, and human were downloaded from the proteome web page at EBI. The original files can be obtained on request from the authors ( of from the following web site http://pondside.uchicago.edu/~lilab/genome_suppl.html).
Protein domains in these sequences were annotated using the InterProScan tool. After duplicate records and nested domains were removed two files were produced for each taxa: *.comp file and *.pos file. *.comp file contains information about domain types within a sequence sorted numerically and has two fields: [sequence name] and [domain list]. *.pos file contains information about exact order and number of domains in a sequences and has three fields: [sequence name] [domain list] [number of domain types]. These files can be obtained on request or downloaded from http://pondside.uchicago.edu/~lilab/genome_suppl.html).
Duplicate genes
We used two protein datasets: (1) the same as above and (2) a dataset in which all proteins displaying highly significant hits to L1-reverse transcriptase. Similarity searches were carried out using FASTA package. FASTA was chosen over the BLAST because of output parsing considerations.
Page 1 of 2