Teaching‎ > ‎

CS112 Summer II 2009 TF

This is the teaching fellow's page. I will post some supplemental material here from time to time, allowing both students and the instructor to keep track of what happened in discussion sessions.

General Information

You're welcome to make appointment by e-mail; please check my schedule first.

Working With CS Computing Resources

See the Getting Started guide for tips on working from home and transferring files over, and for a primer on using Linux.

Homework plans

Unless otherwise extended, all homework assignments are due at 11:59pm on the due date. There are two deadlines: one that carriers a premium bonus of 10% is typically on Friday, and the regular deadline is on Monday.

  • Homework 1: out 6/29, due 7/3 (+10%) or 7/6.
  • Homework 2: out 7/6, due 7/10 (+10%) or 7/13.
  • Homework 3: out 7/13, due 7/17 (+10%) or 7/20.
  • Homework 4: out 7/20, due 7/30 (+10%) or 8/3.

Note that midterm is 7/20, due date of Homework 3; and final exam is 8/6, three days after the due date of the last homework.

Homework submission

For submitting homework, gsubmit instructions can be found here. The instruction assumes you worked in the computer lab in the CS cluster. If you worked from another computer, please see Getting Started for different ways to copy your files to the CS cluster for submission.

Final Project

posted Jul 20, 2009, 11:19 AM by Site Admin Role   [ updated Jul 23, 2009, 2:47 PM ]

Administrative

Deadlines:
  • With a 10% bonus: Thursday, July 30, 2009, at 11:59pm.
  • Last deadline: Monday, August 3, 2009, at 11:59pm.
  • No late submission will be accepted.

All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.

You will be given 1.5 weeks (with a bonus) or 2 weeks (without bonus) to finish this assignment. You may pair up with another student (forming a team of two), but you must form the group and notify Ilir (cc' Likai) of the team by July 23, (Thursday).

Description

In this homework, you will implement a search engine for plain text files stored on the same computer the search engine runs. The search engine will rely on concepts such as dictionary (hash table) and sets (binary search tree) to maintain the index used for searches and to manipulate search results.

Crawl and Index Phase

Your search engine program will start by asking the user to locate the directory to be indexed. You can either use the JFileChooser dialog (as in NanameVerifierMain.java in homework 1 and NanameSolverMain.java in homework 2) or prompt the user for a directory name at the console. Once the user picks a directory, your crawler will recursively index all files under the directory. The JFileChooser dialog will give you a File object, which actually represents a path name, not an opened file.

The File object can be used to enumerate all the files in a directory as follows: you can tell whether the file object represents a regular file or a directory. If the file is a regular file, then you index it. If the file is a directory, then you can call listFiles() method to get a list of files it contains. Then for each file returned, you either index it or traverse the directory recursively.

To open a file for indexing, you will use the Scanner class. Essentially, each string you read using the next() method will be a word that you index. For each file indexed, you will maintain a histogram dictionary that maps from all the words in the file to the number of their occurrences. Convert all the words to lowercase for case-insensitive indexing. The default delimiter for word is white space, but you can add punctuation to make it more appropriate for the English language (this is one of the optional features you can implement).

After you finish reading a file and populating the histogram for word occurrence, you will update a master word index. The master word index is a dictionary that maps from a word to a set of files that contains the word. Notice that a word can happen in multiple files, so the "value" part of the "key-value" dictionary mapping must be a set.

To sum up,

  • An indexed file contains mappings from String (a word) to Integer (the number of occurrences).
  • The master index contains mappings from String (a word) to a set of indexed files containing that word.

Query Phase

After crawling and indexing is done, you will prompt the user for search terms. The search syntax, for simplicity, is borrowed from Forth (see ForthInterpreterMain.java in homework 3).

Search query syntax:

  • A string that is not recognized as any of the operators below is a search term.
  • Operator ?: conducts a search for the search term on top of the operand stack.
  • Operator OR: forms a new search term by OR'ing the top two search terms on the operand stack.
  • Operator AND: forms a new search term by AND'ing the top two search terms on the operand stack.

As such, a search term is not just a string, but it is an instruction for manipulating search results from the master index. For example, the query string:

url uri OR rfc AND ?

searches the master word index for the set of files that contain the word "url" (let's refer to this as the url-set), "uri" (referred to as the uri-set), and "rfc" (referred to as the rfc-set). It will take the union of the url-set and uri-set, and intersect the result with the rfc-set. The resulting set will contain all files that contain either the words "url" or "uri" but must contain the word "rfc".

In other words, the OR operator corresponds to the union operation on two sets of search results, and AND operator corresponds to the intersection operation.

Once you get the desired search result set, just print out the file names of all files in the set.

Grading Criteria

A search engine does not stop here. There are many ways for improvement, so this project is separated into required and optional components. The total of this homework is 200 points, but you may implement more for extra credit. In your homework submission, you must clearly indicate the options taken.

Required

  • (60 pts) Crawling and indexing.
  • (60 pts) Execution of search queries.

Optional

  • (40 pts) Implementation of all data structure. You will reuse code from previous homework assignments as well as implement the hash table for dictionary. This could be a chance for you to fix code that did not work before. If this option is not taken, you may use Java data structure API, e.g. HashMap and anything you find applicable in the java.util package.
  • (20 pts/40 pts) Relevance metric. Since you keep a list of words and their number of occurrences for each file, devise a relevance metric based on the number of occurrences, and sort the search result according to relevance. You get 20 points for this item if you use Java's sorting routine, and 40 points if you implement your own merge sort or quick sort.
  • (20 pts) Improve word delimiter for the English language. The Scanner uses white space for word delimiter by default, but in English, punctuation marks should also be considered. Modify the delimiter to properly delimit words for English. You will need to know Java regular expression in order to do that, but it is not hard to figure out.
  • (40 pts) Use symbol table to improve memory usage. Since most words will happen rather frequently, it does not make sense to store duplicate string objects in memory. An idea is to use a "symbol table" which assigns each word a unique integer. Only the integer is used everywhere else. The symbol table maps from the integer back to the word using a dictionary. You will have to modify your crawling, indexing, and searching code accordingly to use the symbol table.
  • (40 pts) Implement search for phrases. In particular, implement a GROUP operator that takes an integer n from the top of the stack, and group the next n words from the stack to form a phrase. For example, "unique new york 3 GROUP" will produce a search term for the phrase "unique new york" (be careful not to reverse the order of words). The phrase search first does the search like the AND operator, but must open and read the files in the search result to make sure the words appear in the file in the same order. Notice that you must not search for exact string matches because any number of white spaces (and punctuations if you used that as delimiter as well) should be ignored.

Further Updates

In this assignment, we do not provide homework template files like before. But in a few days, we will provide the text corpus for your search engine to crawl and index. Please check back for updates.

Update (July 23, 2007): a corpus (~3MB compressed, or ~12MB decompressed) is provided at http://cs-people.bu.edu/liulk/cs112/corpus.zip (password protected). Access information is sent on the class mailing list.

Homework 3

posted Jul 13, 2009, 1:20 PM by Site Admin Role

Administrative

Deadlines:
  • With a 10% bonus: Friday, July 17, 2009, at 11:59pm.
  • Last deadline: Monday, July 20, 2009, at 11:59pm.
  • No late submission will be accepted.

All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.

Description

Here is a quick rundown of a list of functions you need to implement or functions of interest. The provided functions may help you with the functions you need to implement, so you should study them.

  • DoublyLinkedList.java
    • (provided) public void Prepend(T item)
    • (provided) public void Append(T item)
    • (10 pts) public T RemoveFirst()
    • (10 pts) public T RemoveLast()
  • Queue.java
    • (20 pts) public class Queue<T>
  • QueueTest.java
    • (10 pts) public void testFIFO()
  • Stack.java
    • (20 pts) public class Stack<T>
  • StackTest.java
    • (10 pts) public void testLIFO()
  • ForthInterpreterMain.java
    • (30 pts) public class ForthInterpreterMain
  • BinarySearchTreeSet.java
    • (20 pts) public Iterator<T> iterator()
    • (provided) public boolean ContainsIter(T item)
    • (provided) public boolean ContainsRecurs(Node<T> curr, T item)
    • (10 pts) public void Add(T item)
    • (20 pts) public T Remove(T item)
    • (10 pts) public void Union(Iterable<T> xs)
    • (10 pts) public void Difference(Iterable<T> xs)
    • (20 pts) public void Intersect(Membership<T> xs)

For a total of 200 points.

If you cannot finish this homework, try to concentrate on what you do know how to do. Try not to work on many incomplete functions for partial credit.

Additional Instructions for Forth Interpreter

Forth is a programming language for manipulating and performing computation on stack. In this assignment, you'll implement a simplified Forth interpreter.

Basic Syntax

A Forth program is read from standard input (Eclipse console) as a sequence of tokens separated by whitespace (space, tab, new line, etc). Each token can either be an integer or the name of an operator. If the token is an integer, it is pushed on the stack. If the token is an operator, then some computation is performed. In particular, there is a special operator "." (just the period) which pops an item from the stack and prints it to the standard output.

Since the program operands are in stack order, "0 1 . ." will first print "1" then "0".

Operators

You will need to implement the following operators.

  • "." (provided): pop the stack and print it to standard output.
  • "dup": duplicate once what's on the top of the stack, e.g. "1 dup . ." prints "1" twice.
  • "drop": discard one item from the top of the stack, e.g. "0 1 drop ." prints "0".
  • "swap": exchange the top two items on the stack, e.g. "0 1 swap . ." prints "0" then "1".
  • "recall": takes one argument n that is >= 0; replaces the argument with a copy of the nth value down the stack. E.g. "5 4 3 2 1 0 3 recall ." prints "3" with the numbers "0" "1" "2" "3" "4" "5" still left on the stack, from top to bottom.
  • Binary arithmetic operators "+" "-" "*" "/" "mod" which pops two operands off the stack then pushes the result back on the stack, e.g. "1 2 + ." prints "3".
  • "negate": performs unary negation on the operand on top of the stack, e.g. "1 negate ." prints "-1"
  • "abs": computes the absolute value, e.g. "-1 abs ." prints "1"
  • "min" and "max": pops two operands off the stack and then pushes either one back, depending on which is smaller (for the "min" operator") or larger (for the "max" operator).

Running the program

ForthInterpreterMain.java hooks the standard input to a scanner, which converts a sequence of characters to a sequence of input tokens. The code for doing that is provided for you, as well as the ability to push integer tokens on the stack and the implementation of the "." operator. Before you can run this program, you need to implement Stack.java. Once this is done, you should be able to enter some numbers and use "." to see that integer tokens are printed in LIFO order.

Once you implement more operators, you can test them similarly, using "." to observe the output.

Example

If you enter "3 dup * ." (push 3 to the stack, duplicate 3, multiply 3 by 3, and print the result) you'd see "9" printed as the result. If you enter "3 dup * 4 dup * + ." ("3 dup *" becomes 9, which is saved to the stack; "4 dup *" becomes 16, which is saved to the stack; add 9 and 16, and print the result), you'd see "25" printed as the result.

Homework 2

posted Jul 2, 2009, 11:37 AM by Site Admin Role   [ updated Jul 6, 2009, 3:34 PM ]

Administrative

Deadlines:
  • With a 10% bonus: Friday, July 10, 2009, at 11:59pm.
  • Last deadline: Monday, July 13, 2009, at 11:59pm.
  • No late submission will be accepted.

All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.

Description

By now, you should be familiar with basic Eclipse features and running unit tests. In this homework, there are two parts. In the first part, you will manipulate singly and doubly linked lists. In the second part, you will write the Naname solver (based on your code in the previous assignment) and write some unit tests for it. The solver will involve writing recursive functions. Again, use the provided code hw2.zip, which is attached below.

Here is a quick rundown of a list of functions you need to implement or functions of interest. The provided functions may help you with the functions you need to implement, so you should study them.

  • SinglyLinkedList.java
    • (provided) public int Length()
    • (10 pts) public void Append(T item)
    • (provided) public T Nth(int n)
    • (10 pts) public void Insert(T item, int n)
    • (20 pts) public void Reverse()
  • SinglyLinkedListTest.java
    • (provided) public void testAppend()
    • (10 pts) public void testNth()
  • DoublyLinkedList.java
    • (30 pts) public void Reverse()
  • DoublyLinkedListTest.java
    • (10 pts) public void AssertDoubleLinks() (explain what it does)
  • Naname.java
    • (10 pts) modify your Naname class from hw1 to make it implement the Puzzle interface.
  • PuzzleSolver.java
    • (provided) interface Puzzle
    • (50 pts) public boolean Solve()
  • PuzzleSolverTest.java
    • (provided) public void testSolveCheckerBoard()
    • (10 pts) public void testSolveAltRows()
    • (10 pts) public void testSolveAltColumns()
    • (10 pts) public void testSolveAltRowsColumns()
  • Sudoku.java
    • (20 pts extra credit) If you want to see that your puzzle solver also works with Sudoku, create a Sudoku class that implements the Puzzle interface, then modify NanameSolverMain to solve for Sudoku puzzles using the same PuzzleSolver you wrote.

For a total of 180 points + 20 extra credit.

If you cannot finish this homework, try to concentrate on what you do know how to do. Try not to work on many incomplete functions for partial credit.

Additional Instructions for the Puzzle Solver

In the last homework, you implemented the Naname class. In particular, there is one method Tally() that gives you the occurrence of each number values 0 (UNDECIDED) and 1 through 9, covered by the "uniqueness domain." For Naname, the uniqueness domain is the diagonal, anti-diagonal, and the sub-block that contains the specific row-column position. For Sudoku, it is the row, column, and the sub-block. You notice that the rules of Naname and Sudoku are similar enough that you can write one solver for both! The solver would use Tally() and only try the number values 1 through 9 that has zero occurrence in the "uniqueness domain." You don't need to try anything else.

In order to write one solver for both Naname and Sudoku, suppose you worked out the "common" interface between them that describes what these puzzle games must implement in common, and you called the interface Puzzle. The Puzzle interface can be found in the PuzzleSolver.java file; please read over it.

Your first step is to copy Naname.java from hw1 and modify it to implement this interface. In order to do that, you need to modify the class line to read:

public class Naname implements Puzzle
{
  ...
}

And then try to work out the discrepencies between what your class provides and what the interface expects. This should be pretty easy.

After you complete this step, the red X next to NanameSolverMain.java and PuzzleSolverTest.java should disappear, as they can now reference the class.

Once you finish PuzzleSolver, select NanameSolverMain.java and click on the Run button. A file dialog will pop up and allow you to choose a text file that contains a partial puzzle. Some solvable and non-solvable puzzles are provided in the examples folder. For each solution found, the completed puzzle is printed to console, and at the end a message box will pop up indicating whether the puzzle can be solved.

Solving some puzzles can take a very long time, but the examples provided should finish running in a minute or two.

Homework 1

posted Jun 29, 2009, 7:23 PM by Site Admin Role   [ updated Jun 30, 2009, 9:29 AM ]

Administrative

Deadlines:
  • With a 10% bonus: Friday, July 3, 2009, at 11:59pm.
  • Last deadline: Monday, July 6, 2009, at 11:59pm.
  • No late submission will be accepted.

All students must adhere to Academic Code of Conduct. You are welcome to brainstorm ideas with your fellow students, but you are not allowed to share code.

Description

The first homework aims to familiarize you with Eclipse and programming in Java. All you need to know is array and simple control statements such as for-loops and if-statements. Please go over the instructions in Working with Eclipse first. The homework problems are included in the provided hw1.zip, attached below. Here is a quick run-down of all the functions you need to implement.

  • SimpleProblems.java:
    • (10 pts) public static long NumberOfArrangements(int n)
    • (10 pts) public static boolean UPCVerify(int[] upc)
    • (10 pts) public static long SumPerfectSquares(int k)
    • (20 pts) public static int[] FibonacciArray(int n)
  • Naname.java:
    • (20 pts) public int[] Tally(int row, int col)
    • (20 pts) public boolean Verify()
    • (30 pts) public static Naname GenerateWithSeed(int[] seed)

For a total of 120 points.

While the problems in SimpleProblems.java are fairly self-explanatory, the Naname game deserves some explanation.

Naname

Naname (斜め) is a Sudoku-like puzzle game played on a 9x9 grid. Each cell on the grid is filled in with a number from 1 to 9. Like Sudoku, the numbers within each 3x3 sub-squares must be unique. However, unlike Sudoku, Naname requires that each diagonal and anti-diagonal contain unique numbers, like illustrated below.

The way the figure is color-coded, each color needs to have exactly one occurrence of 1 through 9. As you can see, the numbers filled in are actually a Naname solution. This file is included under examples/solution1.txt; another example is given as examples/solution2.txt. Several non-solutions are also provided.

Running the Verifier

A class called NanameVerifierMain is provided. Select the class and click on the Run button. You will be presented with a File Chooser dialog where you can choose the file that may or may not contain a solution. Once you click Okay, it will create a Naname object and then call your Naname.Verify() method. Based on the result you return, it will show a message box telling you whether the file contains a solution or not.

The format of the file is very simple. It just contains 9 lines, each contains 9 numbers. Non-digits are ignored, as well as lines that contain fewer than 9 digits, so you can write comments in the file without special syntax. Some example solutions as well as non-solutions are provided in the examples folder under the project.

1-4 of 4