Supplementary Methods

1. Sequencing and assembly 2

1.1 Cloning 2

1.2 Sequencing 2

1.3 Assembly 2

2. Physical Mapping 3

2.1 Hybridisations 4

2.2 Fingerprints 4

2.3 Fluorescent in situ Hybridisations (FISH) 5

2.4 Flow Cytometry 5

3. Ultracontigs, assembly validation and genome coverage 6

4. Transposable elements (TEs) 8

5. Tetraodon Gene Annotation 10

5.1 Repeat Masking 10

5.2 Exofish between mammal genomes and Tetraodon 10

5.3 Exofish between the Takifugu genome and Tetraodon 11

5.4 Genewise 11

5.5 cDNAs 12

5.6 Geneid and Genscan 13

5.7 Integration of resources using GAZE 14

6. Analysis of specific gene families 14

6.1 Class I cytokines and their receptors 14

6.2 Selenoproteins 17

6.3 HOX genes 19

7. Estimate of gene numbers and new human genes 20

8. Genome evolution 21

8.1 Identification of vertebrate orthologs 21

8.2 Neutral rate of DNA evolution 22

8.3 Rate of evolution in coding DNA 22

8.4 Genome duplication 23

8.5 Synteny 23

8.6 Pairing duplicate chromosomes by orthology with the human genome 24

9. Protein domain analysis 26

10. References 26

1. Sequencing and assembly

1.1 Cloning

Randomly sheared and size-selected genomic DNA isolated from three individuals was used to prepare plasmid libraries with an average insert size of 4 kb (Broad Institute – BI- , one individual), 2 kb, 2.5 kb and 4 kb (Genoscope – GSC-, one individual) and a BAC library of average insert size 136 kb (GSC, one individual). After sequencing and quality checks, 93% of the sequences were paired that is, derived from opposite ends of the same clone (Table SI1).

1.2 Sequencing

Sequencing was performed as described previously for Genoscope4 and the Broad Institute5,6. Approximately 4.25 million reads passed extensive checks for quality and source, representing approximately 8.3-fold sequence coverage of the Tetraodon genome.

1.3 Assembly

Arachne is a software package developed at BI and it has been adapted for the assembly of large (mammalian-size) genomes, such as the mouse genome5. Arachne is sensitive to polymorphic bases in sequence reads, because they can be considered as sequencing errors or sequence duplications. Here, sequence reads were obtained from three individuals (GSC-plasmid 2.4 million reads; BI-plasmid 1.8 million reads; GSC-BAC 0.05 million reads), thus increasing the likelihood of polymorphic bases. Different strategies were tested to assemble the Tetraodon genome. First, an assembly with all 4.25 million reads combined was attempted but because of the high rate of polymorphic bases, resulted in too many redundant contigs to even proceed with scaffolding. Then each dataset (GSC and BI, each originating from a different fish) was assembled separately, and each assembly produced much better result than with the combined dataset. Despite a lower coverage, the assembly of the BI reads alone was better than the assembly of the GSC reads alone, perhaps because of differences in polymorphism or because the assembly code handled lower coverage by polymorphic reads better. For this reason we used the BI reads as the basis for the rest of the assembly, and added GSC reads to increase the coverage. This assembly reads has an N50 contig size of 18.4 kb and an N50 supercontig size of 635 kb.

A separate assembly was performed at Genoscope with sequence reads from both centres. In brief, the combined reads from GSC and BI were first quality clipped and masked for repeats, then “single-linkage” clustered based on a minimum of 150 bases of overlap with 97% identity. This step was performed with the lspmul algorithm7 implemented in the Biofacet package (http://www.gene-it.com). Each cluster was then assembled separately using Phrap (P. Green, unpublished) to produce consensus sequences (contigs). Paired-end information was then used to link and orientate contigs. When the linking information suggested an overlap between contigs, their corresponding clusters were merged and re-assembled with Phrap. This procedure was repeated until no new contigs could be generated. This assembly produced 103,412 contigs covering 358 Mb, with an N50 value of 8.7 kb. Connecting contigs using read pairs produced 6592 scaffolds (325 Mb including 67 Mb of gaps) with an N50 length of 222 kb.

When comparing these contigs to those produced by Arachne using BLASTN, and although the same set of reads were used as input, we found that about 10% of the contigs were unique to the Genoscope assembly. In addition, when comparing all Arachne contigs to themselves, about 10% of the sequence was redundant, with small contigs (less than 5 kb) generally included in large ones (more than 50 kb). We therefore decided to remove the ~10% redundant small contigs from the Arachne assembly, and add the ~10% new sequence assembled at Genoscope. The composite assembly was the basis for the analysis described in this article.

2. Physical Mapping

When the Tetraodon sequencing project was initiated in 1997, only 25 % of the human genome was sequenced (in draft form) and no large genome had ever been assembled using the “Whole Genome Shotgun” approach. It was unclear at the time if this approach would succeed without the assistance of long range scaffolding information such as could be provided by physical or genetic maps8,9. In doubt, and because Tetraodon cannot yet be bred in captivity to generate a genetic map, we reasoned that a physical map may be necessary to assist the assembly of the genome sequence. We proceeded along three strategies: hybridisation on clone libraries, restriction digest fingerprints and in situ hybridisation on metaphase chromosomes.

2.1 Hybridisations

The pilot phase of the Tetraodon sequencing project initially produced about 50,000 BAC end sequences from two BAC libraries A and B that together represent about 14 fold coverage of the genome10. We robotically arrayed 55,000 BAC clones (10 fold coverage) on nylon membranes as previously described11,12. We generated hybridisation probes by designing PCR primers from BAC end sequences and amplifying from total genomic DNA. After validation by gel electrophoresis, PCR products were labelled by incorporation of digoxygenin (DIG, Roche Molecular Diagnostics) during the amplification reaction. Hybridisations were performed as described previously12. We routinely hybridised between 24 and 48 probes per day, for a total of about 3,000. Images were captured under long wave UV light as Tiff images and positive signals were manually scored using the Xdigitise software (Huw Griffith, Hans Lehrach, personal communication). Probes that hybridised to more than 35 clones were considered non-specific and were not considered further. At some point we decided to change strategy and switch to a BAC fingerprinting method. By then 2,308 single copy probes had been hybridised successfully (i.e. probes that hit between 1 and 35 clones) to about 60% of the library. A first set of 901 contigs were build from this data using the probeorder software13.

2.2 Fingerprints

To supplement the hybridisation data but also to increase the resolution of the physical map, we systematically fingerprinted both BAC clone libraries. Plasmid DNA was digested with EcoRI and run on 1% agarose gels, stained with SybrGreen and captured as Tiff images. Individual restriction bands from the digests were manually scored using the Image software and comparisons between 32,817 BAC restriction profiles were performed with FPC as described14. A cutoff score of 10-9 and a variable tolerance of 7 were used. This produced 3,354 contigs and the largest where manually inspected, broken or fused when necessary, resulting in 2,659 contigs. Data from the hybridisations were incorporated as markers (probes) associated with BAC clones, which facilitated the inspection of FPC contigs. Still, because EcoRI digests on these BAC clones generated too few visible bands (about 15 on average), a large fraction of clones (12,096 clones out of 32,817 that were successfully fingerprinted) remained as singletons.

2.3 Fluorescent in situ Hybridisations (FISH)

In order to get a better overview of the pufferfish genome organisation at a chromosomal level, we obtained the karyotype15, representative idiograms of the chromosomes and FISH patterns of various repetitive and single copy probes in metaphase plates. Simple and double-FISH of BACs were performed in high stringency conditions, in presence of competitor (Tetraodon) and carrier (bovine) sonicated DNA, after a pre-hybridization step to re-anneal repeated sequences. For probes inserted in plasmids, the vectors were first hybridised alone to check for the presence of any potential contaminating signal. Tiff images were captured and analysed using the GENUS animal karyotyping and FISH-imaging software (Applied Imaging). In particular, the multiple locations of most transposable elements have been examined by hybridising them separately but also two by two to identify areas where they accumulate and their location relative to each other. Repetitive sequences like rDNA gene clusters, centromeric and subtelocentric satellite sequences and telomeric repeats10,15 have been previously physically mapped on the chromosomes. Clusters of genes implicated in the immune system (Ig, MHC etc…) were also precisely located. To validate the assembly and map or orientate ultracontigs on their chromosome of origin, 392 double-FISH experiments, using combinations of 117 biotin or digoxigenin labelled BAC clones were performed. In addition, 99 double colour FISH were performed with 22 transposable element probes to assign them to specific heterochromatic blocks16.

2.4 Flow Cytometry

The three Tetraodon specimen were purchased in France in an aquarium fish retailer. Blood was extracted and resuspended in anticlotting solution containing 1.8 mg/ml Pefabloc SC (Merck). The suspension was centrifuged at 2,000 rpm for 2-3 min and the pellet washed three times in anticlotting solution to remove lysed cells. Before freezing at -80°C, the suspension was supplemented with 10% DMSO. Blood samples from the three Takifugu specimen were kind gifts from Prof. Toshiaki Itami (Miyazaki University, Japan) and Prof. Shugo Watabe (Tokyo University, Japan). Flow cytometry experiments were performed as described previously17.

Surprisingly, intra species differences are important, which may be due to differences in sex, age or rearing conditions17. The intraspecific variations noted here for the two pufferfish (Figure S3) have been noted before in the case of the fly Drosophila melanogaster, with in addition an effect of environmental conditions17. This implies that genome size measurements must always be compared not only to the same standard but also within the same experiment. Because of these variations, no rigorous absolute value can be given here for the Tetraodon and Takifugu genome size. In the vast majority of cases however, estimates described here indicate that the Tetraodon genome is smaller than the Takifugu genome, and this is reflected in the average values for each genome given as reference in the main text.

3. Ultracontigs, assembly validation and genome coverage

To build ultracontigs, three types of information were used. First, we compared the scaffolds to all BAC and plasmid end sequences that did not participate in the assembly. The two main reasons to explain why these sequences were not included in the initial build are that either the sequences were not of sufficient quality, or they were eliminated because they contained highly repetitive sequences. Sequences of insufficient quality may still provide useful linking information if additional constraints are used: they must find a single position in the assembly, and the distance between their ends must be in agreement with the range of inserts of the clone library. Most interesting are those where the two ends are located in different scaffolds. In this case, the sum of the distance to the end of the scaffolds must be within the range of the library insert sizes. We retrieved 2,576 linking BAC clones and 3,065 linking plasmid clones using this approach.

The second type of information was extracted from a global alignment with the Takifugu genome assembly. This was performed initially for the purpose of identified evolutionary conserved sequences (ecores) to assist the annotation of Tetraodon protein coding genes (see below). High scoring segments pairs (HSPs) that are contiguous between genomes denote regions of synteny. In many cases such runs of collinear HSPs are interrupted by the end of a scaffold on Tetraodon and start again on a new scaffold. In such cases, the Takifugu scaffold effectively links the two Tetraodon scaffolds. We imposed two simple constraints to eliminate potential wrong links due to a loss of synteny within a gap in the Tetraodon assembly by ensuring that the sum of the distances between the end of the Tetraodon scaffolds and the two HSPs closest to the ends is compatible with the inter-HSP distance in Takifugu. In addition, in 370 cases the scaffolds linked by Takifugu also contained a new BAC and/or plasmid link, and the two types of information always agreed. This provided a firm basis to use Takifugu links alone to associate and orient two Tetraodon scaffolds. In total, 4,685 links were provided by the alignment with Takifugu.

All new links where used to build ultra-contigs automatically using Cover and Coverparse, two programs written at Genoscope and routinely used to assemble sequence contigs in BAC sequencing projects. Ultra-contigs consisted in 2,962 new links (47.3% Takifugu alone; 40,2% clones alone, 27,3% Takifugu and clone). This data was then transferred to an Acedb database and the 128 largest were manually examined for internal consistency and retained for FISH mapping on Tetraodon chromosomes (See above). Probes for FISH were BAC clones selected at the ends of ultracontigs and were thus required to hybridise on the same chromosome arm. A side effect of the FISH mapping is thus to validate all internal inter-scaffold links comprised between the two probes, since a wrong link would have a much higher probability of associating scaffolds from different chromosomes than from the same chromosome. In all cases the two probes did hybridise on the same chromosome arm, validating in particular 216 links created by Takifugu alone, which represent a subset of the 1,401 links created by this approach.

To estimate the actual proportion of euchromatin present in the assembly, we sequenced 1,472 new reads from clones that did not participate in the assembly but originate from one of the two shotgun plasmid libraries. The initial intention was to align these reads to the assembly and infer an estimation of genome coverage from the percentage of aligned reads. However it turns out that such estimation is highly sensitive to the quality of the reads, the alignment strategy and the alignment parameters. To alleviate some of these variables, we used the following approach: