Project 1: Simple Inverted Index

Objectives

The objectives for this project, in decreasing order of importance are:

1.  Verify that you have the ability to login, run, and monitor a HaDoop job, and that you can copy data in/out of HDFS.

2.  Become familiar with decomposing a simple problem into Map and Reduce stages.

3.  Get a sense for how MapReduce can be used in the real world.

Background

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:
Doc1:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Doc2:

Buffalo are mammals.

we could construct the following inverted file index:

Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Description

For this project, you will work individually to construct a simple inverted file index to hash words to a document ID. You will turn in your Mapper(s) and Reducer(s), and a short writeup with your findings.

Note that we are not specifying what corpus you should index, to give you the chance to pick a corpus which is interesting to you. Note also that the input formats between various corpora differ; it is not unreasonable to expect that your Mappers and Reducers cannot not be used as-is between shakespeare and the web crawl.

Part 0

Load a test corpus into HDFS; this corpus can consist of any documents you desire. We suggest creating at least one test corpus not larger than a few hundred words, as a small test corpus might help you debug your Mapper and Reducer classes if necessary. The final Mappers and Reducers which you turn in are not required to be compatible with the test corpus from part 0.

Please describe your test corpus (including HDFS path) in the write-up specified below.

Part 1

Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. For the first part of this project, run a word count over an interesting corpus (eg, Shakespeare or the web crawl data) to identify any such words.

In the required write-up, please explain how you determined a word was "noisy".

Part 2

For this portion of the assignment, you will design a MapReduce-based algorithm to calculate the inverted index over the corpus you used in Part 1. Your final inverted index should not contain the words identified in part 1 of this project. How you choose to remove those words is up to you; one possibility is to create multiple MapReduce passes, but there are many possible ways to do this. Also, the format of your MapReduce output (ie, the inverted index) must be simple enough to be machine-parseable; it is not impossible to imagine your index being one of many data structures used in a search engine's indexing pipeline. Lastly, your submitted indexer should be able to run successfully on the corpus specified in your writeup. "Successfully" means it should run to completion without errors or exceptions, and generate the correct word->DocID mapping.

You will be required to turn in all relevant Mapper and Reducer Java files, in addition to any supporting code or utilities. Your write-up should specify what corpus your Mapper(s) and Reducer(s) are designed for.

Part 3

In addition to the requirements detailed above, you will need to implement at least one extension of your choice. This extension may be as simple or as complex as you'd like; we primarily want to give you a chance to play with the MapReduce system. We have provided some sample extensions below, but don't let our suggestions limit your creativity!

Please describe what extension(s) you chose to do in the final write-up.

Write-up

Answers the questions below and submit with your project:

1.  (answer this before you begin) How long do you think this project would take you to finish?

2.  How much time did you actually spend on this project?

3.  Acknowledgement of any assistance you received from anyone except assigned course readings and the course staff.

4.  What test corpus did you load into HDFS? Where is it located (ie, HDFS path)?

5.  What corpus are your Mappers and Reducers designed for?

6.  What, if any, "noisy words" did you find as a result of your WordCount? By what criteria did you determine that they were "noisy"? Were you surprised by any of these words?

7.  Which extension(s) did you do? Please also detail any interesting results you found.

8.  What did you enjoy about this assignment? What did you hate? Could we have done anything better?

Submit your source files, your jar file, as well as the dfs location of your input (and output) files. Tar up a directory containing these, and submit.

Extensions

We have provided a list of extensions below; however, feel free to do extensions not in this list.

·  A naive parser will group words by attributes which are not relevant to their meaning. Modify your parser to "scrub" words. You can define "scrub" however you wish; some suggestions include case-insensitivity, punctuation-insensitivity, etc. You will get extra karma for creating a language-independant scrubbing algorithm, but this is not required.

·  Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase!) and return the IDs of the documents which contain that word.

·  Instead of creating an inverted file index (which maps words to their document ID), create a full inverted index (which maps words to their document ID + position in the document). How do you specify a word's position in the document?

·  If you created a full inverted index, write a query program on top of the index which returns not only the document IDs but also a text "snippet" from each document showing where the query term appears in the document.

·  Modify your parser to strip HTML tags (if your corpus includes HTML). Eg, "<B>Bolded text</B>" should parse as "Bolded" and "text", not "<B>Bolded" and "text</B>"

·  Run your indexer on a non-English corpus. How do the results differ for western european languages (eg, German, Spanish, Portugese)? How about non-western-european languages (eg, Russian, Korean, Vietnamese)?

·  Any other cool extension you can think of!

Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under the Creative Commons Attribution 3.0 License -- http://creativecommons.org/licenses/by/3.0/