Natural Communities from the Full Tree:
The directory of all clusters is located in jmy6/citeseer/src/100treetest/allcomms1 (This is in sundial and you should have access to the files – if not tell me and I’ll change it)
1 is the base directory – the other 99 trees are in directories allcomms2-100 but you don’t have to worry about these files
The files in this directory are named by the cluster name (ie. all the names that are present in the list of natural communities correspond to the files in this directory). Each file has in it a list of papers that are included in the cluster.
The corresponding core clusters of the natural communities are located in jmy6/citeseer/src/100treetest/base1core
They have the same name and format as regular clusters.
Natural Communities from the Before Tree
The directory of all clusters is located in jmy6/citeseer/src/100treetest/allcommsbefore991
The files in this directory are named by the cluster name (ie. all the names that are present in the list of natural communities correspond to the files in this directory). Each file has in it a list of papers that are included in the cluster.
The corresponding core clusters of the natural communities are located in jmy6/citeseer/src/100treetest/base1core70before
They have the same name and format as regular clusters.
Other Useful Files:
All the files below are located in jmy6/citeseer/src/100treetest/
naturalcommunities100runsbase1-5compressed7 – a list of all natural communities from full tree
naturalcommunitiesbefore1-5compressed7 – a list of all natural communities from the before tree
dcorebefore – if loaded in matlab, this is a list of all pairs of matching communities from the before and after trees followed by their match
foundinafterbutnotbefore – all natural communities found in after tree but not in the before tree
foundinbeforebutnotafter – all natural communities found in before tree but not in the after tree
foundinafterbutnotbeforenothreshold - same as above but excludes threshold communities
foundinbeforebutnotafternothreshold - same as above but excludes threshold communities
Useful Operations
Note (I didn’t say how to find the strength of a community since it is somewhat complicated and I think I have given you all the strengths you would need)
So given these files, there are several useful things that you can do with them.
The program int-title allows you to type in a paper number (ie one that you find by looking inside a community) and spits out the title. Simply type int-title at the prompt and wait for the output ready to scan, after which you can just type numbers and it will spit out the title. Type “-2” to end the program.
Another useful program is it find the frequency of words in a community. This can be done using the following script: (A copy is in jmy6/citeseer/src/ called “naturalwordfreq”)
#!/bin/bash
echo "" > temp1 //this is just to make sure that the output file is empty
for i in `cat 100treetest/naturalcommunitiesbefore1-5compressed7`; //the pathname is just the list of clusters
that you want to examine
do
int-title < 100treetest/base1core70before/$i > tester // run int-title with the path to the cluster files and
output to a temp file
python title-freq.py tester 7 > temp1 // run the python script that counts the frequency – tester is the temp output file, 7 is the number of keywords you want to track (you can change this to anything) and temp1 is the output file
done
So basically 100treetest/naturalcommunitiesbefore1-5compressed7 contains the list of clusters we want to examine and 100treetest/base1core70before/ is where they are located
Another useful program is to find the size of a community. This is done in matlab with a file called sizeofcomm.m located in the directory jmy6/citeseer/src/100treetest/
Simply run the program by typing sizeofcomm(directory of clusters, list of names of clusters)
The last useful program would be to find the year distribution of the papers in a cluster. It can be run using the following script (Located in jmy6/citeseer/src called yearsdistscript):
#!/bin/bash
echo "" > test1 //just to make sure output file is empty
for i in `cat 100treetest/naturalcommunities100runsbase1-5compressed7`; // list of all communities you want to examine
do
yearsdist 100treetest/base1core/$i > test1 // yearsdist is the program, 100treetest/base1core/ is the directory that the files are located, test1 is the output
done
I can’t think of anything else you would want, but email me if you have any questions, I’ll check quite often