Natural Communities from the Full Tree:
The directory of all clusters is located in jmy6/citeseer/src/100treetest/allcomms1 (This is in sundial and you should have access to the files – if not tell me and I’ll change it)
1 is the base directory – the other 99 trees are in directories allcomms2-100 but you don’t have to worry about these files

The files in this directory are named by the cluster name (ie. all the names that are present in the list of natural communities correspond to the files in this directory). Each file has in it a list of papers that are included in the cluster.
The corresponding core clusters of the natural communities are located in jmy6/citeseer/src/100treetest/base1core

They have the same name and format as regular clusters.
Natural Communities from the Before Tree

The directory of all clusters is located in jmy6/citeseer/src/100treetest/allcommsbefore991

The files in this directory are named by the cluster name (ie. all the names that are present in the list of natural communities correspond to the files in this directory). Each file has in it a list of papers that are included in the cluster.
The corresponding core clusters of the natural communities are located in jmy6/citeseer/src/100treetest/base1core70before

They have the same name and format as regular clusters.

Other Useful Files:
All the files below are located in jmy6/citeseer/src/100treetest/

naturalcommunities100runsbase1-5compressed7 – a list of all natural communities from full tree

naturalcommunitiesbefore1-5compressed7 – a list of all natural communities from the before tree

dcorebefore – if loaded in matlab, this is a list of all pairs of matching communities from the before and after trees followed by their match
foundinafterbutnotbefore – all natural communities found in after tree but not in the before tree

foundinbeforebutnotafter – all natural communities found in before tree but not in the after tree

foundinafterbutnotbeforenothreshold - same as above but excludes threshold communities

foundinbeforebutnotafternothreshold - same as above but excludes threshold communities
Useful Operations

Note (I didn’t say how to find the strength of a community since it is somewhat complicated and I think I have given you all the strengths you would need)

So given these files, there are several useful things that you can do with them.

The program int-title allows you to type in a paper number (ie one that you find by looking inside a community) and spits out the title. Simply type int-title at the prompt and wait for the output ready to scan, after which you can just type numbers and it will spit out the title. Type “-2” to end the program.

Another useful program is it find the frequency of words in a community. This can be done using the following script: (A copy is in jmy6/citeseer/src/ called “naturalwordfreq”)
#!/bin/bash

echo "" > temp1 //this is just to make sure that the output file is empty

for i in `cat 100treetest/naturalcommunitiesbefore1-5compressed7`; //the pathname is just the list of clusters

that you want to examine

do

int-title < 100treetest/base1core70before/$i > tester // run int-title with the path to the cluster files and

output to a temp file

python title-freq.py tester 7 > temp1 // run the python script that counts the frequency – tester is the temp output file, 7 is the number of keywords you want to track (you can change this to anything) and temp1 is the output file

done

So basically 100treetest/naturalcommunitiesbefore1-5compressed7 contains the list of clusters we want to examine and 100treetest/base1core70before/ is where they are located
Another useful program is to find the size of a community. This is done in matlab with a file called sizeofcomm.m located in the directory jmy6/citeseer/src/100treetest/
Simply run the program by typing sizeofcomm(directory of clusters, list of names of clusters)
The last useful program would be to find the year distribution of the papers in a cluster. It can be run using the following script (Located in jmy6/citeseer/src called yearsdistscript):

#!/bin/bash

echo "" > test1 //just to make sure output file is empty

for i in `cat 100treetest/naturalcommunities100runsbase1-5compressed7`; // list of all communities you want to examine

do

yearsdist 100treetest/base1core/$i > test1 // yearsdist is the program, 100treetest/base1core/ is the directory that the files are located, test1 is the output

done
I can’t think of anything else you would want, but email me if you have any questions, I’ll check quite often