Manuel for Multi-Node TOM

Manual for the MTOM software

Ai Li1 Steve Horvath1,2

1 Dept. of Biostatistics, School of Public Health, UCLA

2 Dept. of Human Genetics, David Geffen School of Medicine, UCLA

Correspondence: or

The MTOM software uses the multi-node topological overlap measure (MTOM) for gene neighborhood analysis and for module detection.

To cite the MTOM software, please use the following references:

· Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581

· Li A, Horvath S (2007) Network Module Detection: Affinity Search Technique with the Topological Overlap Measure. Submitted.

Here we provide a brief user manual.

0. The Windows software can be downloaded from the following webpage

http://www.genetics.ucla.edu/labs/horvath/MTOM/

1. After double clicking the MTOM icon, you should see the following screen that allows you to input the data.

Fig. 1

Note that the MTOM software has 3 tabs:

a) Data Input

b) Finding Gene Neighbors

c) Module Detection

In the following, we explain these tabs.

2. Data Input

The software allows one to input gene expression data and to compute a gene co-expression network. Alternatively, one can input a co-expression similarity measure (assume to take on values in the unit interval). This similarity measure can be transformed into an adjacency measure by raising it to a power (soft thresholding which results in a weighted network adjacency) or by dichotomizing it (hard-thresholding it which results in an unweighted network).

Alternatively, one can input an adjacency matrix directly. Input format: square matrix where all entries take on values in the unit interval.

[1] Input Microarray Gene Expression Data

In most applications, the user will input gene expression data. In the following, we describe how to input gene expression data and how to compute a corresponding gene co-expression network. Later, we describe how to carry out network neighborhood analysis and module detection.

The file format looks as follows

Fig. 2

The file should be a comma delimited Excel “.csv” file. The first row can contain the column names first column can contain the gene names (or probeset IDs). Rows correspond to genes, columns correspond to microarray samples. Missing data should be represented by a blank or by “NA” (which stands for not available).

To input a gene expression data file, select “Expression” and click “Input”.

Then you will see the following input dialog:

Fig. 3

Select the file you want to input in the file-open dialog, click “open”.

One may need to wait for a few seconds if your file has more than 10,000 genes. A summary of the input data file will appear

Fig. 4

Click “OK” to proceed. But if the number of rows or columns does not make sense to you, check the input data. Remember that this should be a comma delimited csv file!

Since neighborhood analysis can be computationally intensive, it is often advisable to reduce the number of genes that are considered for the network construction. To do this, we provide multiple filters for reducing the number of genes.

Click “OK” on the dialog which just jumped out.

Fig 5

The software implements four filtering options:

a) coefficient of variation, which is defined as the standard deviation divided by the mean, i.e., CV=sqrt(variance)/mean.

b) variance of the expression

c) mean expression

d) whole network connectivity

Usually, we filter genes based on the variance and the connectivity. The order in the panel below matters. For example, when choosing a number of 10000 for the variance and a number of 4000 for the connectivity, the program will first select the 10000 most varying genes, next compute the connectivity of each of these genes, and then restrict the analysis to the 4000 most connected genes (among the 10000 most varying genes). Since modules are comprised of highly connected genes, one does not lose much information when restricting the analysis to the most connected genes.

Fig. 6

Click “Filter” after choosing the filtering criteria. The computation of the connectivity may take a few minutes if you have more than 10,000 genes. After the filtering is done, you will be asked whether you want to save the filtered data set into a separate file. Regardless of whether or not you save the filtered data, the software will compute a co-expression network based on these filtered genes.

Correlation matrix: The next step is to compute a correlation matrix which is a measure of similarity between the gene expression profiles. We have implemented an option to compute a leave one out correlation matrix which is the average correlation after leaving out microarray samples one at a time. This may take a few minutes if you have more than 10,000 filtered genes.

Adjacency matrix: After the correlation matrix is computed, you can choose thresholding approach. Soft thresholding with the power adjacency function will result in a weighted gene co-expression network (Zhang and Horvath 2005). Hard thresholding will result in an unweighted network. The standard approach considers the absolute value of the correlation matrix, i.e. the co-expression information ignores the sign of the correlation. However, we have also implemented an option to keep track of the sign of the correlation which results in a signed network, i.e. the neighbors will have positive correlations with the seed genes.

Fig. 7

It may take several minutes if you input a data set with more than 10,000 genes.

[2] Input a similarity matrix data file

A similarity matrix specifies how similar two nodes are. The program allows the user to turn this similarity into a network by soft or hard thresholding. For example, the similarity measure could be the absolute value of a correlation matrix. After inputting it, you need to choose thresholding approach to turn it into an adjacency matrix.

There is no difference for the format between similarity matrix data file and expression file described in Fig. 2. One just needs to make sure that it is a symmetric square matrix without missing entries.

After inputting a similarity matrix, one needs to calculate the adjacency matrix just as is described in Fig. 7

[3] Input an adjacency matrix data file

Same input format as for a similarity matrix and expression file described in Fig. 2. Just make sure that it is a symmetric square matrix without missing entries.

[4] Input an interaction data file

The software also allows the user to input the network in two column format. For example, Fig. 8 presents an example of a protein protein network interaction file. This assume an unweighted network, i.e adjacencies are either 1 or 0.

Fig. 8

2. Neighborhood Analysis

After inputting the data, we are now ready for neighborhood analysis.

Click the tab “Finding Gene Neighbors”

Fig. 9

After click search a dialog box in Fig. 10 will appear if multiple seeds are input. Click “OK” to proceed.

Fig. 10

This output allows one to determine whether seed genes have a reasonably high correlation with each other. It does not make sense to input a pair of seed genes if they don’t have a reasonably high correlation (say larger than .5 but this depends to some extent on the number of microarrays).

3. Module Detection.

This tab implements the Module Affinity Search Technique (MAST).

The procedure forms modules around a set of hub neighborhoods which can either be input directly or automatically determined by the program.

The procedures is carried out in 3 steps, which are described in our article and in the figure below.

Don’t try to understand the details from the figure.

The main message is that you can either import a list of initial hub seeds or the program can find initial seeds automatically (step 1).

In step 2, the hub seeds and neighborhoods are grown into preliminary modules.

In step 3, the preliminary modules may be merged if their relative similarity passes a threshold

The following tab describes module detection.

Regarding step 1

The following file is not necessary when the initial hubs are chosen automatically. But if you want to input your own initial seeds or hub neighborhoods directly, use the following format:comma delimited csv file.

The second column labels the initial seeds or hub neighborhoods.

In other words, if you input 10 distinct hubs then you end up with 10 rows whose labels run from 1 to 10. However, if you have 2 seeds for one intial neighborhood, then you would have 2 corresponding rows.

Potential application: This kind of input allows you to first use hierarchical clustering in R to find modules. Next to use hub nodes in those modules as seeds for MTOM.

The output file assigns to each gene a cluster label.

Importantly, 0 is reserved for un-assigned nodes.

Sometimes we use the color grey in network analysis to denote these unclustered genes.

Output file:

If you want to learn more about the module detection, please contact us by email.

THE END