COS 226 Programming Assignment Checklist: WordNet

Frequently Asked Questions

Should SAP work if the digraph is not a DAG? Yes, the definition still applies in the presence of directed cycles.

Should I re-implement breadth-first search in my SAP class? No, it is easier and better design to reuse DirectedPathFinderBFS.java.

Can I use my own Digraph class? No, it must have the same API as our Digraph.java class; otherwise, you are changing the API to the SAP constructor (which takes a Digraph argument).

Is a vertex considered an ancestor of itself? Yes.

Can a noun appear in more than one synset? Absolutely. It will appear once for each meaning that the noun has. For example, here are all of the entries in synsets.txt that include the noun word.

37559,discussion give-and-take word,an exchange of views on some topic; "we had a good discussion"; "we had a word or two about it"
50266,news intelligence tidings word,new information about specific and timely events; "they awaited news of the outcome"
60429,parole word word_of_honor,a promise; "he gave his word"
60430,password watchword word parole countersign,a secret word or phrase known only to a restricted group; "he forgot the password"
80883,word,a unit of language that native speakers can identify; "words are the blocks from which sentences are made"; "he hardly said ten words all morning"
80884,word,a brief statement; "he didn't say a word about it"
80885,word,a verbal command for action; "when I give the word  charge!"
80886,word,a word is a string of bits stored in computer memory; "large computers use words up to 64 bits long"

Can I assume the id numbers will be integers in a small range? Yes, if there are V synsets, the ids will be numbered 1 through V (sorry, not the usual 0 through V-1). However, there is no guarantee that the id numbers appear consecutively in the input file.

Should my program work on datasets other than WordNet? Absolutely. It should work on any datasets in the appropriate format.

Some of the glosses have example sentences at the end. What is this? The example sentence is considered to be part of the gloss. You shouldn't need to do anything special to handle it.

Any advice on how to read in and parse the synset and hypernym data files? Use the readLine() method in our In library to read in the data one line at a time. Use the split() method in Java's String library to divide a line into fields. You can find an example using split() in Domain.java. Use Integer.parseInt() to convert string id numbers into integers.

In WordNet, what should glosses() return if the noun is not in WordNet? Return an Iterable that has zero items.

How much memory should my WordNet program use on the wordnet data set? Provided you are not using excessive memory, e.g., quadratic proportional to the input size, and it makes your program faster or more readable, then it's ok to allocate more space than whatever java specifies on your computer. Note that it is also possible to implement WordNet without needing more memory.

I'm an ontologist and I noticed that your hypernyms.txt file contains both is-a and is-instance-of relationships. Yes, you caught us. This ensures that every noun (except entity) has a hypernym. Here is an article on the subtle distinction.

What should I do if one of the nouns in Outcast is not in WordNet? We'll only give you nouns that are in WordNet. You can also assume that all of the synsets have a common ancestor (e.g., entity for synsets.txt).

Input, Output, and Testing

Input and output. We encourage you to create your own (possibly pathological) inputs to help test your program. If your datasets create problems for other programs (or ours!), we'll award extra credit. The input should be very small, and it should expose a potential flaw that other programs are likely to face. In your readme.txt, you should describe what the input is testing.

Some examples. Here are some interesting examples that you can use in your code.

The following synset has several paths to the same ancestor.

municipality -> region:
municipality -> administrative_district -> district -> region
municipality -> populated_area-> geographic_area -> region

The following two synsets have different paths to common ancestors.

individual -> physical_entity, edible_fruit -> physical_entity:
individual -> object -> physical_entity
individual -> causal_agency -> physical_entity
edible_fruit -> garden_truck -> food -> solid -> matter -> physical_entity
edible_fruit -> reproductive_structure -> plant_organ -> plant_part ->
                natural_object -> unit -> object -> physical_entity

The following pairs of nouns are very far apart:

23 white_marlin, mileage
32 Black_Plague, black_marlin
32 American_water_spaniel, histology
32 Brown_Swiss, barrel_roll

The following synset has many ancestors and paths to "entity".
```
Ambrose Saint_Ambrose St._Ambrose
```

Possible progress steps

Create the data type SAP. First, think carefully about designing a correct and efficient algorithm for computing the shortest ancestral path. Consult a staff member if you're unsure. Design small DAGs to test and debug your code.
Download the directory wordnet. It contains the WordNet data files synsets.txt and hypernyms.txt described in the assignment.
Read in and parse the files synsets.txt and hypernyms.txt. Don't worry about storing the data in any data structures yet. Test that you are parsing the input correctly before proceeding.
Create a data type WordNet. Divide the constructor into two subtasks.
- Read in the synsets.txt file and build appropriate data structures.
- Read in the hypernyms.txt file and build a Digraph.
If you read in synsets.txt first, you can identify the largest id before constructing the Digraph. Check that it is 81,426, but do not hardwire this number into your program.
Add the method isNoun() and glosses(). If you chose appropriate data structures when parsing synsets.txt, this step will be relatively easy.
Try some nouns that participate in many synsets, e.g., run, face and back.

Optional Optimizations

There are a few things you can do to speed up a sequence of SAP computations on the same digraph. Do not attempt to do any of these unless you have thoroughly tested your code. Be sure that your solution is as modular as the code you change and works as well. In other words, test thoroughly.

The bottleneck operation is re-initializing arrays of length V to perform the computation. This should be done once for the first SAP computation, but for subsequent ones, you can remember which array entries changed in the previous computation, and only re-initialize those entries. Since only a small number of entries will change, this can lead to dramatic savings.
To compute the distance between two nouns A and B, you compute the SAP between all pairs of synsets of A and synsets of B, and return the shortest such one. Naively, this involves a * b SAP calculations if A is involved in a synsets and B is involved in b synsets. However, it's possible to compute the overall shortest SAP with one or two graph searches (as opposed to a*b). To accomplish this, you may add the following methods to the API for SAP.
```
// return length of shortest ancestral path between any vertex in v[]
// and any vertex in w[]; -1 if no such path
public int length(int[] v, int[] w)

// return a common ancestor of some vertex in v[] and some vertex in w[]
// that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int[] v, int[] w)
```