TAGS: a Tool for Gene Set Analysis of Expression Time Series

TAGS: a tool for gene set analysis of expression time series

Ying Liu

MOE Key Laboratory of Bioinformatics and Bioinformatics Div, TNLIST
Department of Automation, Tsinghua University, Beijing 100084, China

Name

TAGS – Time-series Analysis for Gene Set

Description

TAGS is a tool for gene set analysis for expression time series, which can incorporate existing knowledge and analyze the dynamic property of a group of genes. It could be used to discover expression regulatory relationships, and analyze a set of genes that have functional or structural associations.

System Requirement

This software has been tested on Intel Core 2 CPU, 2G RAM and Windows XP/Vista. A minimum of 500M space on the hard disk is required to install and run it. Perl should be installed, and the current edition of TAGS has been tested under ActivePerl.

Installation

Double click ‘TAGS.msi’ to install TAGS. Follow the instructions to complete the installation. The TAGS files will be extracted to the installation folder specified during installation.

Usage

Double click the desktop shortcut to start TAGS. In Windows Vista one must find the executable file ‘TS.exe’ in the installation directory and run it manually as administrator. Figure 1 shows the TAGS main window.

Figure 1 The main window of TAGS

· Loading files

There are five types of data files which can be loaded by TAGS (see Files and Directory Structure):

ü Expression file

ü Covariate file

ü Rank file

ü Result file

ü External file

One can load these files through the Load menu.

· Discovery of regulatory relationships across a time course

Load expression, covariate and external files which contain gene expression time series, time points, and regulator expression time series, respectively. Click Analysis->With External Data (Figure 2). The parameters and options are as following:

§ Permutation times: times of gene-set permutation (default 10 for a try). We recommend at least 100 times of permutation in a real analysis. Type the number in the line edit.

§ Gene set path: click browse… to specify the directory which contains candidate target set files.

§ Q value cutoff: a number between 0 and 1 as the Q value cutoff when there is more than 1 candidate sets (regulators). When there is only 1 set to be tested, the Q value is equivalent to the commonly used P value. All the sets with Q values less than or equal to the cutoff is reported by TAGS.

Click OK to run the process after specifying appropriate parameters and options, and the current parameters and options will be stored by TAGS and will appear as default next time. Click Cancel to return to the main window.

Figure 2 Analyses from external data

· Discovery of significant gene sets for expression time series

One can do the gene set based analysis both from expression profile and a predefined gene rank.

ü Analysis from expression profile

TAGS can calculate significant gene sets using an expression file loaded in the last step. One can do the analysis through Analysis->From Expression Data (Figure 3). The options and parameters are as follows:

Figure 3 Analysis from expression profile

n Ranking method: regression, variance and correlation can be used for gene ordering. If regression or correlation is used, a covariate file will be needed. If variance is used, only gene-set permutation can be employed for the significance evaluation because time-point permutation cannot change variance (see Permutation method).

u Basis for regression: the basis used for the regression analysis of a single time series. Both natural cubic spline and polynomial spline are provided. Use the combo box for a selection. Cubic spline (default) is recommended because its flexibility.

n Permutation method: both time-point permutation and gene-set resampling can be used to evaluate the significance of enrichment scores.

n Permutation times: times of time point permutation (default 10 for a try). We recommend at least 100 times of permutation in a real analysis. Type the number in the line edit.

n Gene set path: use browse… to specify the directory which contains candidate gene set files.

n Q value cutoff: a number between 0 and 1 as the Q value cutoff when there is more than 1 candidate sets. When there is only 1 set to be tested, the Q value is equivalent to the commonly used P value. All the sets with Q values less than or equal to the cutoff is reported by TAGS.

n Adjust rank: if selected, q-value, variance or Pearson correlation coefficient is used for calculating the enrichment score.

u Tie: genes with similar q-values, variances or Pearson correlation coefficients (the difference is less than Threshold) are considered as a tie.

u Weighting: q-value, variance or Pearson correlation coefficient is used for weighting.

Click OK to run the process after specifying appropriate options and parameters, and the current options and parameters will be stored by TAGS and will appear as default next time. Click Cancel to return to the main window. First, TAGS will call an EDGE function to calculate a gene rank according to each gene’s differential expression. Next, time point permutation is done and corresponding ranks are calculated with the same strategy as above. Finally, gene set analysis is done to find the significant gene sets. The running time depends on the number of candidate sets and, more importantly, the permutation times.

A result dialog will open automatically when calculation is finished (Figure 4). The Significant Gene Set(s) text browser shows the result, including the order of gene sets according to their Q values, the significant gene set (represented by corresponding file names), the normalized enrichment score, P value, Q value. One can respecify the Q value cutoff according to the results through the Q value cutoff text edit, and click Recalculate for a recalculation. New results will appear in the same dialog in the same format just described. Click Save Result to show the save dialog (Figure 5), specify the result file name and path, click OK to save the result containing the significant sets (TAGS records all the sets together with their information automatically to rec.tmp in the installation path). Click Cancel to return to the result dialog. TAGS also saves the leading-edge subset for each significant gene set to the ‘lead’ directory (see Files and Directory Structure) automatically. Click Done in the result dialog to return to the main window.

Figure 4 The result dialog

Figure 5 Save analysis result

ü Analysis from gene rank

TAGS can run an analysis against an existing gene rank file, generated by either the user or other software and loaded through the Load menu. One can access to this function through Analysis->From Gene Rank (Figure 6). The parameters are as follows:

Figure 6 Analysis from gene rank

n Permutation method: both gene-set resampling and time-point permutation can be used to evaluate the significance of enrichment scores.

u Permutation times: if gene-set permutation is used, permutation times should be clarified.

u Permutation file path: if time-point permutation is used, a directory containing files of ranks generated by time point permutation should be prepared. Because there is no expression file, TAGS cannot permute time points automatically to generate corresponding gene lists. Users must prepare the gene ranks which are needed for the analysis and load them into TAGS.

n Gene set path: use browse… to specify the directory which contains candidate gene set files.

n Q value cutoff: same as the corresponding parameter in Analysis from Expression Data. See Analysis from expression profile.

n Adjust rank: same as the corresponding parameter in Analysis from Expression Data. See Analysis from expression profile.

Click OK to run the process after specifying appropriate parameters, and the current parameters will be stored by TAGS and will appear as default next time. Click Cancel to return to the main window. Result is shown in the result dialog (see Analysis from expression profile).

· Image

TAGS can plot 2 kinds of images for the illustration of the result.

ü Heatmap for significant gene set

When the analysis is finished, or a result file is loaded (see Loading files), one can click Image->Heat Map for Significant Gene Set to show the heatmap dialog (Figure 7). All the significant gene sets will be listed in the Select a set combo box. Choose one and click OK to show the heatmap (Figure 8). Click Save Heatmap under the image to save the heatmap as a bmp file to a specified directory. Click Done to return.

Figure 7 Choosing a significant set to plot heatmap

Figure 8 Heatmap of a significant gene set

ü Histogram of permuted NESs (normalized enrichment scores)

When an analysis is done (either from expression profile or from a gene rank), click Image->Histogram of Permuted NESs, a histogram of all the permuted NESs will be shown (Figure 9). Click Save Histogram to save the image as a bmp file. Click Done to return to the main window.

Figure 9 Histogram of permuted NESs

Files and Directory Structure

l File types

ü Expression file

A tab delimited file containing the expression matrix. Each row represents a time series except the first row which contains a header and sample (time point) names for each column. The first column contains the gene or probe set names, which should be the same identifiers as those in the gene set files. The following is an example:

GeneName GSM27015 GSM27016 GSM27017 GSM27018 GSM27019

AA004795 582.583 933.728 481.011 572.583 641.637

AA010078 77.757 73.5 122.316 89.047 106.645

…

ü Covariate file

A tab delimited file containing the time point variant of each sample. The first row contains a header and sample (time point) names, and the second row specifies the corresponding time points. The following is an example:

Cov Name GSM27015 GSM27016 GSM27017 GSM27018 GSM27019

Age 26 26 27 29 30

ü Rank file

A tab delimited file containing a gene rank, with or without a header row. The first column is the order (1, 2, 3, …). The second column is the gene or probe set names quoted by ‘’’, which should be the same identifiers as those in the gene set files. The third column is p-values. The forth column is q-values which may be used for weighting. Specifically, the EDGE output can be used directly for the analysis. Here is an example:

Rank Gene Name P-Value Q-Value

1 'VDAC1P' 7.963686e-07 0.0002886315

2 'RAP2A' 7.963686e-07 0.0002886315

…

ü Result file

A file generated by TAGS, containing the analysis result, i.e. significant gene sets and other relevant information. The format is the same as in the text browser in the result dialog (see Usage).

ü Gene set file

Each gene set file represents a candidate set for analysis. Each line is a gene or probe set name, which should be the same identifier as that in the expression or rank file. Repeated rows are NOT allowed. See the following example:

PBEF1

NT5C2

…

ü External file

A tab delimited file containing the expression time series of regulators (i.e., TFs). The file format is the same as Expression file (see Expression file). The regulator identifiers in the first column should be the same as the gene-set file names.

l Directory structure

There are four directories in the installation folder.

ü lead: used to store the leading-edge subset of each significant gene set for further analysis. The file name is the order of the corresponding set in the result dialog. The file format is the same as that of gene set files.

ü permutedCovariate: used to store the permuted covariate files when analyzing from expression profile.

ü permutedRank: used to store the ranks generated from the files in the permutedCovariate folder.

ü ranks: used to store ranks corresponding to each candidate regulator if one is analyzing regulators with their targets.

ü R-2.6.2: R version 2.6.2, which is used by TAGS.

There are also some files in the installation folder. Sometimes users may want to double click TS.exe to run the software.