Post date: Jul 20, 2009 6:19:45 PM
Deadlines:
All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.
You will be given 1.5 weeks (with a bonus) or 2 weeks (without bonus) to finish this assignment. You may pair up with another student (forming a team of two), but you must form the group and notify Ilir (cc' Likai) of the team by July 23, (Thursday).
In this homework, you will implement a search engine for plain text files stored on the same computer the search engine runs. The search engine will rely on concepts such as dictionary (hash table) and sets (binary search tree) to maintain the index used for searches and to manipulate search results.
Your search engine program will start by asking the user to locate the directory to be indexed. You can either use the JFileChooser dialog (as in NanameVerifierMain.java in homework 1 and NanameSolverMain.java in homework 2) or prompt the user for a directory name at the console. Once the user picks a directory, your crawler will recursively index all files under the directory. The JFileChooser dialog will give you a File object, which actually represents a path name, not an opened file.
The File object can be used to enumerate all the files in a directory as follows: you can tell whether the file object represents a regular file or a directory. If the file is a regular file, then you index it. If the file is a directory, then you can call listFiles() method to get a list of files it contains. Then for each file returned, you either index it or traverse the directory recursively.
To open a file for indexing, you will use the Scanner class. Essentially, each string you read using the next() method will be a word that you index. For each file indexed, you will maintain a histogram dictionary that maps from all the words in the file to the number of their occurrences. Convert all the words to lowercase for case-insensitive indexing. The default delimiter for word is white space, but you can add punctuation to make it more appropriate for the English language (this is one of the optional features you can implement).
After you finish reading a file and populating the histogram for word occurrence, you will update a master word index. The master word index is a dictionary that maps from a word to a set of files that contains the word. Notice that a word can happen in multiple files, so the "value" part of the "key-value" dictionary mapping must be a set.
To sum up,
After crawling and indexing is done, you will prompt the user for search terms. The search syntax, for simplicity, is borrowed from Forth (see ForthInterpreterMain.java in homework 3).
Search query syntax:
As such, a search term is not just a string, but it is an instruction for manipulating search results from the master index. For example, the query string:
url uri OR rfc AND ?
searches the master word index for the set of files that contain the word "url" (let's refer to this as the url-set), "uri" (referred to as the uri-set), and "rfc" (referred to as the rfc-set). It will take the union of the url-set and uri-set, and intersect the result with the rfc-set. The resulting set will contain all files that contain either the words "url" or "uri" but must contain the word "rfc".
In other words, the OR operator corresponds to the union operation on two sets of search results, and AND operator corresponds to the intersection operation.
Once you get the desired search result set, just print out the file names of all files in the set.
A search engine does not stop here. There are many ways for improvement, so this project is separated into required and optional components. The total of this homework is 200 points, but you may implement more for extra credit. In your homework submission, you must clearly indicate the options taken.
Required
Optional
In this assignment, we do not provide homework template files like before. But in a few days, we will provide the text corpus for your search engine to crawl and index. Please check back for updates.
Update (July 23, 2007): a corpus (~3MB compressed, or ~12MB decompressed) is provided at http://cs-people.bu.edu/liulk/cs112/corpus.zip (password protected). Access information is sent on the class mailing list.