Supplementary Methods

Derrien et al

The GENCODE v7 catalogue of human long non-coding RNAs: Analysis of their gene structure, evolution and expression.

LncRNA 5’ end CAGE support

To find transcript 5’ end CAGE support, we used CAGE clusters made from PolyA+ whole cell CAGE experiments in 12 different cell lines, that were further predicted as TSS and highly reproducible across bioreplicates (ENCODE CODE: NCP005). An lncRNA transcript having a CAGE cluster within 50bp of its TSS defined to have 5’ end CAGE support – resulting in 2,156 transcripts (15%). The corresponding number for protein-coding transcripts was 38,831 (55%). To control for possible influence of absolute transcript expression level, we divided the lncRNA and the protein-coding transcripts into log10(RPKM) expression bin classes, calculated as mean RPKM across 24 ENCODE CAGE experiments.

LncRNA 5’and 3’ end PET support

101 bp windows around lncRNA 5’ and 3’ ends were searched for stranded inclusion of PET 5’ and 3’ ends (respectively) from 16 different PET experiments (ENCODE CODE: NCP005). When a hit was found on the 5’ (or 3’) side, the lncRNA was considered to have 5’ (or 3’) end PET support. A total of 1,503 lncRNA transcripts were found to have both 5’ and 3’ end PET support (10%), and as did 26,679 protein-coding transcripts (39%). As discussed above, to control for possible influence of absolute transcript expression level, we divided the lncRNA and the protein-coding transcripts into log10(RPKM) expression bins. In order to look for PolyA motifs, we extracted 101 bp around the 3’ end of three transcript sets (14,880 lncRNAs, ~30,000 mRNAs and 14,880 random lncRNAs which are lncRNAs remapped randomly in the genome). In this region, we searched for the six known polyadenylation signals: AATAAA, ATTAAA, TATAAA, AGTAAA, AAGAAA and AATATA. For each 3’ transcript sets, we calculated the proportions of transcripts exhibiting at least one of these motifs.

LncRNA genes connected to neighbouring protein-coding genes by Paired-end RNAseq data

Two GENCODE genes g1 and g2 are considered as connected by Paired-end (PE) RNAseq if there is at least one ENCODE PE stranded experiment (K562, GM12878, H1hESC, polyA+ or polyA-) where two mates of a PE read, m1 and m2, 1) map uniquely on the genome and 2) are included in an exon of g1 and g2 respectively, on the correct strand. Also, we consider a protein-coding gene to be the neighbor of a lncRNA gene if there is no other protein-coding or lncRNA gene between them. Thus we find 449 lncRNA genes connected to neighbouring protein-coding genes (9% of 5,017 lncRNA genes with neighbouring protein-coding genes), and a total of 3,729 connected neighbouring protein-coding genes (17% of 21,366 ).

phastCons scores for lncRNAs and mRNAs

For each lncRNA or protein-coding transcript, the exons, introns and promoters were compared with the vertebrate phastCons elements 46-way table (Fujita et al. 2011). We considered as promoters the regions 1kb upstream the transcription start site (TSS). The set of ancestral repeats (AR) included 1,309 repeat elements (i) conserved in dog, mouse and human, (ii) that do not overlap any functional elements and which mapped within 10kb from lncRNAs, in order to avoid possible biased gene conversion (BGC) artifacts (Ponjavic et al. 2007; Galtier and Duret 2007).

Cross-species lncRNAs screening pipeline

The screening pipeline considered a set of 18 mammalian genomes (cat, chimp, cow, dog, elephant, guinea pig, horse, human, macaque, marmoset, mouse, opossum, orangutan, panda, pig, platypus, rabbit, rat) downloaded from UCSC and Ensembl repositories (Fujita et al. 2011; Flicek et al. 2011). The following assemblies used were: felCat4, panTro2, bosTau4, canFam2, loxAfr3, cavPor3, equCab2, hg19, mmul1.0, calJac3, mm9, monDom5, ppyg2, ailMel1, susScr2, ornAna1, oryCun2, rn4. The screening comprised three main steps. The first consisted in searching each query against the target genomes with a version of BlastN optimized for ncRNA discovery (Freyhult et al. 2007). Secondly, using the Exonerate program (Slater and Birney 2005) each query was realigned against the genomic neighborhood identified by Blast. For each query and for each genome, the best hit was kept, as reported by the Blast E-value, that was successfully remapped by exonerate. The alignments spanning at least 70% of the human queries were retained. Finally each query was compared to all the putative discovered homologs by realigning the transcripts sequences with T-Coffee (Notredame et al. 2000) and measuring the query/homolog pairwise similarity. The values calculated in this way were used to generate the heatmap in Figure 4.

LncRNA Family Clustering

The set of 14,880 GENCODE lncRNAs was clustered using blastclust (version 2.2.25), a component of the NCBI-Blast package (Altschul et al. 1990). The sequences were clustered considering an inclusion E-value threshold of 0.00001, a minimum percent of identical residues of 40 and minimum length coverage of 40%. Two clustering strategies were pursued. One considering the straight blastclust on the sequences, and the other by applying ancestral repeats and low complexity filtering. The arising clusters were aligned by default TCoffee and average pairwise similarity and compensatory mutations were estimated (Notredame et al. 2000).

Chromatin Signatures

Histone modifications were downloaded from UCSC genome browser on August 15th 2010. ChIPseq reads for each data set were extended to the full fragment length (300bps for histone modifications and 225bps for Pol2). For each position in the genome we counted the number of extended reads overlapping this position. We refer to this number of the “count of the position”. We then aligned counts with respect to TSS of GENCODE v7 transcripts and averaged each position over all TSS of the set. Finally, we represented these aggregated plots for different lncRNAs and mRNAs transcripts that were considered as transcribed (RPKM>1) in the same cell line where the ChIPseq experiment was conducted.

Microarray Analysis

Custom lncRNA microarrays were designed using the Agilent eArray system and printed in the 8 x 60k format, with probe length of 60 nt. Starting with 9929 GENCODE3c transcripts, we managed to design suitable microarray probes for 9747. For the majority of transcripts - 9731 - we designed 6 distinct probes. In addition, each array contained a set of 3231 randomly-selected probes from the Agilent catalogue targeting human protein-coding genes. Prior to hybridisation, RNA sample quality was checked by Bioanalyzer (Agilent), and only those with RNA Integrity Number (RIN) > 0.7 were used. RNA samples were obtained from the following sources: purified brain region RNA were from postmortem brain of the same healthy 52-year old male, with a 3 hour postmortem delay; other human tissue RNAs were purchased from Ambion and came from various different subjects (FirstChoice panel, Cat. AM6000); cell line RNAs were extracted from cultured cells. 100ng of total RNA was oligo-dT reverse transcribed and labelled with Cy3 using the Low Input QuickAmp kit (Agilent) in single-colour format according to manufacturers protocol. Complementary RNA was hybridised to arrays according to manufacturers protocol, and scanned on an Agilent G2565CA microarray scanner at 100% PMT and 3μm resolution. Intensity data was extracted using the Feature Extraction software (Agilent). Raw data is taken from the Feature Extraction output files and was corrected for background noise using the normexp method (Ritchie et al. 2007). To assure comparability across samples we use quantile normalization (Bolstad 2001. Probe Level Quantile Normalization of High Density Oligonucleotide Array Data. Unpublished manuscript http://bmbolstad.com/stuff/qnorm.pdf). Expression data was log2 transformed.

Probes were defined as “present” if they passed, for at least 50% of the samples, the “glsPosandSignif” flag provided by the Feature Extraction output files, with an intensity 1.5x calculated background (this threshold was increased to 2x in the case of “stringent” calling). A transcript was defined as “present” if at least 3 of its corresponding probes were present. In this case, the transcript expression intensity was calculated to be the median value of its present probes' intensities. All statistical analyses and filtering were performed with the Bioconductor project (http:// www.bioconductor.org/) in the R statistical environment (Gentleman et al. 2004). All expression data can be found in the Supplementary Data Files.

Correlation of Expression Analysis

Expression values (RPKMs) for all GENCODE v7 genes were computed according to the ENCODE pipeline (ENCODE CODE: NCP005) at the exon, transcript and gene levels and for the 16 tissues (adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid and whiteblood) of the Human Body Map data set (ArrayExpress ID: E-MTAB-513; http://www.ebi.ac.uk/ena/data/view/ERP000546). Then, we assigned a vector of expression to each GENCODE v7 lncRNA and protein-coding gene (the latter consisting of ~20,500 genes as defined previously). Between and within each set of lncRNAs and/or mRNAs, we constructed all pairwise combinations, assummarized in Supplementary Table S5. For each combination having at least a non null vector of expression (breadth of expression greater or equal to 1), we calculated non-parametric Spearman correlations between each pair of lncRNA-mRNA, lncRNA-lncRNA, mRNA-mRNA and lncRNA-mRNA_random. The latter corresponds to correlations where the vector of expression of the mRNA was randomly shuffled. It has to be noticed that Spearman correlations are less influenced by outliers values (which is typically the case given the depth of RNAseq experiments) and non-linearity, compared to Pearson correlations. Finally, for the four resulting data sets, we plotted the density of correlation coefficients rho (rs), using the “ggplot2” package in R.

Nuclear Enrichment Analysis

RPKM values were extracted from ENCODE RNAseq quantifications for all transcripts in the lncRNA set, or transcripts with “protein_coding” biotype (ENCODE CODE: NCP005). A simple ratio of RPKM in nucleus/cytoplasm was calculated for those ENCODE cell lines where both nuclear and cytoplasm RNAseq data were available, and in the single case (K562) where chromatin RNAseq data was available. A transcript was only included for a particular cell type, if its RNAseq quantification fell below an IDR (irreproducible discovery rate) of 0.1 in both cell compartments. Nuclear/cytoplasmic or chromatin/nuclear ratios were log10 transformed. The same steps were applied in calculating polyA+ / polyA- ratios for nuclear RNA.

Detection of Consistently Compartimentalized Genes

The statistical method proposed to detect specific genes between compartments using RNAseq experiments from the ENCODE consortium is an extension to the classical Poisson regression (Marioni et al. 2008) with one block effect, based on the negative binomial distribution.

By supposing that, for gene k, Yijkl is a random variable which represents the number of counts of the jth replicate (j=1,2,…J) from the ith compartment (i=1,2) taken from the lth cell line (l=1,2), the probability of observing any specific count y can be written as follows:

Where Γ is the gamma function, λijkl = E(Yijkl), and r is the dispersion parameter which allows the variance of Yijkl to exceed the mean of Yijkl and thus guarantee to each effect not to be underestimated. By fitting for each gene, the following model:

where τik is the compartment’s effect and βkl the cell line effect which can be seen as the block effect for gene k, we can test the hypothesis :

H0: τ1k = τ2k vs H1: τ1k ≠ τ2k

The specific differential tests have been made using likelihood ratio tests which consist in comparing the ratio of maximum log-likelihood estimates of λijk, taken in the null (numerator) and alternative (denominator) hypothesis to a χ2 distribution of one degree of freedom. Finally, the Benjamini-Hochberg (from FDR family) adjustment has been applied to the raw P-values associated with the compartment’s effect, and so controlled the risk of increase of the Type I error rate, generated by multiple testing.

Multidimensional Scaling Approach (MDS) Representation:

From a given distance matrix Y (n rows, n columns) corresponding here to the pairwise Spearman coefficients of correlation of variables from expression datasets, multidimensional scaling proposes a set of graphs which permit to represent the studying distances in specific spatial plans.

Starting with the works of Torgerson W.S and Gower J.C, the principle is to calculate the eigenvalues and eigenvectors of the matrix B= -0.5*J*Y2*J to determinate the best projection of the variables in a subspace of given dimension (where J = In-1/n, with In the unit matrix of size n).

Indeed, by keeping the m first eigenvalues, the matrixgives the coordinates of the genes in the optimal plan (where Lm is the diagonal matrix of the m eigenvalues, and Em the unit corresponding matrix)

Now taking x and y, two distinct genes from the starting dataset, we consider the two types of distance:

- Distance of correlation (Spearman) : d(x,y) = 1-cor(x,y)

- Distance of squared correlation (Spearman) : d(x,y) = 1-cor2(x,y)

And so take into account the two types of associated distances matrix. The comparison of the two graphs generated by these matrix with cmdscale function from the R software, give us complementary information about the structure of correlation of the original dataset.

1