Final Project

Post date: Jul 20, 2009 6:19:45 PM

Administrative

Deadlines:

    • With a 10% bonus: Thursday, July 30, 2009, at 11:59pm.
    • Last deadline: Monday, August 3, 2009, at 11:59pm.
    • No late submission will be accepted.

All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.

You will be given 1.5 weeks (with a bonus) or 2 weeks (without bonus) to finish this assignment. You may pair up with another student (forming a team of two), but you must form the group and notify Ilir (cc' Likai) of the team by July 23, (Thursday).

Description

In this homework, you will implement a search engine for plain text files stored on the same computer the search engine runs. The search engine will rely on concepts such as dictionary (hash table) and sets (binary search tree) to maintain the index used for searches and to manipulate search results.

Crawl and Index Phase

Your search engine program will start by asking the user to locate the directory to be indexed. You can either use the JFileChooser dialog (as in NanameVerifierMain.java in homework 1 and NanameSolverMain.java in homework 2) or prompt the user for a directory name at the console. Once the user picks a directory, your crawler will recursively index all files under the directory. The JFileChooser dialog will give you a File object, which actually represents a path name, not an opened file.

The File object can be used to enumerate all the files in a directory as follows: you can tell whether the file object represents a regular file or a directory. If the file is a regular file, then you index it. If the file is a directory, then you can call listFiles() method to get a list of files it contains. Then for each file returned, you either index it or traverse the directory recursively.

To open a file for indexing, you will use the Scanner class. Essentially, each string you read using the next() method will be a word that you index. For each file indexed, you will maintain a histogram dictionary that maps from all the words in the file to the number of their occurrences. Convert all the words to lowercase for case-insensitive indexing. The default delimiter for word is white space, but you can add punctuation to make it more appropriate for the English language (this is one of the optional features you can implement).

After you finish reading a file and populating the histogram for word occurrence, you will update a master word index. The master word index is a dictionary that maps from a word to a set of files that contains the word. Notice that a word can happen in multiple files, so the "value" part of the "key-value" dictionary mapping must be a set.

To sum up,

    • An indexed file contains mappings from String (a word) to Integer (the number of occurrences).
    • The master index contains mappings from String (a word) to a set of indexed files containing that word.

Query Phase

After crawling and indexing is done, you will prompt the user for search terms. The search syntax, for simplicity, is borrowed from Forth (see ForthInterpreterMain.java in homework 3).

Search query syntax:

    • A string that is not recognized as any of the operators below is a search term.
    • Operator ?: conducts a search for the search term on top of the operand stack.
    • Operator OR: forms a new search term by OR'ing the top two search terms on the operand stack.
    • Operator AND: forms a new search term by AND'ing the top two search terms on the operand stack.

As such, a search term is not just a string, but it is an instruction for manipulating search results from the master index. For example, the query string:

url uri OR rfc AND ?

searches the master word index for the set of files that contain the word "url" (let's refer to this as the url-set), "uri" (referred to as the uri-set), and "rfc" (referred to as the rfc-set). It will take the union of the url-set and uri-set, and intersect the result with the rfc-set. The resulting set will contain all files that contain either the words "url" or "uri" but must contain the word "rfc".

In other words, the OR operator corresponds to the union operation on two sets of search results, and AND operator corresponds to the intersection operation.

Once you get the desired search result set, just print out the file names of all files in the set.

Grading Criteria

A search engine does not stop here. There are many ways for improvement, so this project is separated into required and optional components. The total of this homework is 200 points, but you may implement more for extra credit. In your homework submission, you must clearly indicate the options taken.

Required

    • (60 pts) Crawling and indexing.
    • (60 pts) Execution of search queries.

Optional

    • (40 pts) Implementation of all data structure. You will reuse code from previous homework assignments as well as implement the hash table for dictionary. This could be a chance for you to fix code that did not work before. If this option is not taken, you may use Java data structure API, e.g. HashMap and anything you find applicable in the java.util package.
    • (20 pts/40 pts) Relevance metric. Since you keep a list of words and their number of occurrences for each file, devise a relevance metric based on the number of occurrences, and sort the search result according to relevance. You get 20 points for this item if you use Java's sorting routine, and 40 points if you implement your own merge sort or quick sort.
    • (20 pts) Improve word delimiter for the English language. The Scanner uses white space for word delimiter by default, but in English, punctuation marks should also be considered. Modify the delimiter to properly delimit words for English. You will need to know Java regular expression in order to do that, but it is not hard to figure out.
    • (40 pts) Use symbol table to improve memory usage. Since most words will happen rather frequently, it does not make sense to store duplicate string objects in memory. An idea is to use a "symbol table" which assigns each word a unique integer. Only the integer is used everywhere else. The symbol table maps from the integer back to the word using a dictionary. You will have to modify your crawling, indexing, and searching code accordingly to use the symbol table.
    • (40 pts) Implement search for phrases. In particular, implement a GROUP operator that takes an integer n from the top of the stack, and group the next n words from the stack to form a phrase. For example, "unique new york 3 GROUP" will produce a search term for the phrase "unique new york" (be careful not to reverse the order of words). The phrase search first does the search like the AND operator, but must open and read the files in the search result to make sure the words appear in the file in the same order. Notice that you must not search for exact string matches because any number of white spaces (and punctuations if you used that as delimiter as well) should be ignored.

Further Updates

In this assignment, we do not provide homework template files like before. But in a few days, we will provide the text corpus for your search engine to crawl and index. Please check back for updates.

Update (July 23, 2007): a corpus (~3MB compressed, or ~12MB decompressed) is provided at http://cs-people.bu.edu/liulk/cs112/corpus.zip (password protected). Access information is sent on the class mailing list.