Semantic Web Project Report
Victor E. Gallego
Table of Contents
Introduction
TheGUI
WebCrawlingandIndexing
Searching
Conclusion
Introduction
This project takes the perspective of the Accreditation chairman, whose role is to aggregate all of the Computer Engineering Professors’ publications from a certain time period. The tool created here will search the web and retrieve the links to those publications.
The GUI
The graphical user interface consists of the following: A search field, this is where the user will enter the query string. A search button, used to submit the query string to the index searcher. An index button, used to run the web crawler. A combo-box, used to select the number of hits to display in the results. A text area, where all the results will be displayed.
Web Crawling and Indexing
The web crawling feature is provided by WebSphinx. Developed by Carnegie Mellon, the library is platform agnostic, built in Java. The library works well with Lucene for indexing. The API documentation is very easy to follow. The web crawling is preformed by the following code:
IndexingCrawler.java
import websphinx.*;
import java.io.*;
import java.net.*;
import java.util.regex.*;
import java.util.regex.Pattern;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public class IndexingCrawler extends Crawler {
private IndexWriter writer;
public IndexingCrawler(IndexWriter writer, String docroot) {
super();
try {
this.setRoot(new Link(docroot));
} catch (MalformedURLException e) {
this.setRoot(null);
}
this.writer = writer;
this.setSynchronous(true);
this.setDomain(Crawler.SERVER);
//set the depth
this.setMaxDepth(20);
}
public void visit(Page p) {
boolean index = false;
System.out.println("Visiting [" + p.getURL() + "]");
if(p.getTitle() == null){
noindex(p);// skip pdf files
}else{
index(p);// process text
}
System.out.println(" Done.");
}
public void index(Page p) {
StringBuffer contents = new StringBuffer();
Document doc = new Document();
doc.add(new Field("path", p.getURL().toString(), Field.Store.YES, Field.Index.ANALYZED));
if (p.getTitle() != null) {
doc.add(new Field("title", p.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
}
System.out.println(" Indexing...");
System.out.println(" depth [" + p.getDepth() + "]");
System.out.println(" title [" + p.getTitle() + "]");
System.out.println(" modified [" + p.getLastModified() + "]");
Element[] elements = p.getElements();
for (int i = 0; i < elements.length; i++) {
if (elements[i].getTagName().equalsIgnoreCase("meta")) {
String name = elements[i].getHTMLAttribute("name", "");
String content = elements[i].getHTMLAttribute("content", "");
if (!name.equals("")) {
doc.add(new Field(name, content, Field.Store.YES, Field.Index.ANALYZED));
System.out.println(" meta [" + name + ":" + content + "]");
}
}
}
Text[] texts = p.getWords();
for (int i = 0; i < texts.length; i++) {
contents.append(texts[i].toText());
contents.append(" ");
}
doc.add(new Field("contents", contents.toString(), Field.Store.YES, Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (IOException e) {
throw new RuntimeException(e.toString());
}
}
public void noindex(Page p) {
System.out.println(" Skipping...");
}
}
Web Crawling is started using the following commands:
private void buildIndex(){
try {
// To store an index on disk, use this instead (note that the
// parameter true will overwrite the index in that directory
// if one exists):
setIndex();
lblIndexingDone.setText("Now Indexing...");
IndexWriter writer = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
// The mergeFactor value tells Lucene how many segments of equal size to build
// before merging them into a single segment
writer.setMergeFactor(20);
// Setup a new IndexCrawler instance
IndexingCrawler c = new IndexingCrawler(writer, "
c.run();
writer.optimize();
// Close the writer when done
writer.close();
} catch (MalformedURLException e) {
e.printStackTrace(System.out);
} catch (IOException e) {
e.printStackTrace(System.out);
}
}
Notice that the depth is set to 20 for this crawler. The index is stored on the file system in the following format:
After the index button has been pressed the GUI will display the results:
Searching
Once the index has been built with Lucene, the index is ready for queries. The following code is responsible for searching the index:
private void runQuery(){
try {
// The "content" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer)
.parse(querystr);
hitsPerPage = Integer.parseInt((String)comboBox.getSelectedItem());
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// Display the results
System.out.println("\nFound " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
System.out.println( "\t"+ d.get("path"));
}
// Searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
} catch (Exception e) {
}
}
After a search the GUI will display the results in the text area. The following screen shot shows the results for a query “Research”.
The following screen shot depicts what happens when the combo-box selects a new set of hits.
As you can see the result list becomes longer, allowing the user to view more results for the query.
Conclusion
This tool is very useful for a professor who has to gather publications for other professors. It is a good example of the different intelligent web technologies. It gives the user a very elegant user interface to work with. Finally, it summarises the different tools we learned during the Fall 2011 semester.