COS 226 Programming Assignment Checklist: WordNet

Frequently Asked Questions

Can I read the synset or hypernym file twice? No, file I/O is very expensive so please read each file only once and store it in an appropriate data structure.

Any advice on how to read in and parse the synset and hypernym data files? Use the readLine() method in our In library to read in the data one line at a time. Use the split() method in Java's String library to divide a line into fields. You can find an example using split() in Domain.java. Use Integer.parseInt() to convert string id numbers into integers.

It takes a very long time to read in the input files in DrJava. What should I do? Use the command line. DrJava incurs substantial overhead with input and output.

Which data structure(s) should I use to store the synsets, synset ids, and hypernyms? This part of the assignment is up to you. You must carefully select data structures to achieve the specified performance requirements.

Do I need to store the glosses? No, you won't use them on this assignment.

Can I use my own Digraph class? No, it must have the same API as our Digraph.java class; otherwise, you are changing the API to the ShorestCommonAncestor constructor (which takes a Digraph argument). Do not submit Digraph.java.

How can I make the data type ShortestCommonAncestor immutable? You can (and should) save the associated digraph in an instance variable. However, because our Digraph data type is mutable, you must first make a defensive copy by calling the copy constructor.

Should I reimplement breadth-first search in my ShortestCommonAncestor class? In the beginning, you should call the relevant methods in BreadthFirstDirectedPaths.java. However, to implement the additional performance requirement, you will need to implement your own version, perhaps in a helper class named DeluxeBFS.java.

For the "additional performance requirements," do length() and ancestor() need to take time proportional to to the number of vertices and edges reachable from the argument vertices in the worst case? Or can I use hashing? You can make standard technical assumptions (such as the uniform hashing assumption). If you do so, state any assumptions that you make in your readme.txt file.

I understand how to compute the length(int v, int w) method in time proportional to E + V in the worst case but my length(Iterable<Integer> v, Iterable<Integer> w) method takes time proportional to a × b × (E + V), where a and b are the sizes of the two iterables. How can I improve it to be proportional to E + V? The key is use a multi-source version of breadth-first search, as in the the constructor in BreadthFirstDirectedPaths that accepts an iterable of sources as an argument (instead of a single source).

Is a vertex considered an ancestor of itself? Yes.

What is the root synset for the WordNet DAG?

38003,entity,that which is perceived or known or inferred to have its own distinct existence (living or nonliving)

Can a noun appear in more than one synset? Absolutely. It will appear once for each meaning that the noun has. For example, here are all of the entries in synsets.txt that include the noun word.

35532,discussion give-and-take word,an exchange of views on some topic; "we had a good discussion"; "we had a word or two about it"
56587,news intelligence tidings word,new information about specific and timely events; "they awaited news of the outcome"
59267,parole word word_of_honor,a promise; "he gave his word"
59465,password watchword word parole countersign,a secret word or phrase known only to a restricted group; "he forgot the password"
81575,word,a string of bits stored in computer memory; "large computers use words up to 64 bits long"
81576,word,a verbal command for action; "when I give the word, charge!"
81577,word,a brief statement; "he didn't say a word about it"
81578,word,a unit of language that native speakers can identify; "words are the blocks from which sentences are made"; "he hardly said ten words all morning"

Can a synset consist of exactly one noun? Yes. Moreover, there can be several different synsets that consist of the same noun. See the President example below.

I'm an ontologist and I noticed that your hypernyms.txt file contains both is-a and is-instance-of relationships. Yes, you caught us. This ensures that every noun (except entity) has a hypernym. Here is an article on the subtle distinction.

What should sca(), ancestor(), or outcast() return if is there is a tie for the shortest common ancestor or outcast? The API does not specify, so you are free to return any such ancestor or outcast.

To meet the "performance requirements" for sca() and distance() can I make a call to the constructor of SCA? You can not make a call to the constructor, just one method of SCA.

Do I need to throw exceptions explicitly with a throw statement? No, it's fine if they are thrown implicitly, e.g., you can rely on any method in Digraph.java to throw a java.lang.IndexOutOfBoundsException if passed a vertex argument outside of the prescribed range. A good API documents the requisite behavior for all possible arguments, but you should not need much extra code to deal with these corner cases.

Input, Output, and Testing

Some examples. Here are some interesting examples that you can use to test your code.

An example where you will need to use an Iterable is when there are two or more synsets with the same noun. An example where there are two synsets is the noun President:

13745,President_of_the_United_States President Chief_Executive,the office of the United States ....
13746,President_of_the_United_States United_States_President President Chief_Executive,the person who holds the office .....

The synset municipality has two paths to region.

municipality -> administrative_district -> district -> region
municipality -> populated_area -> geographic_area -> region

The synsets individual and edible_fruit have several different paths to their common ancestor physical_entity.

individual -> organism being -> living_thing animate_thing -> whole unit -> object physical_object -> physical_entity
person individual someone somebody mortal soul -> causal_agent cause causal_agency -> physical_entity
edible_fruit -> garden_truck -> food solid_food -> solid -> matter -> physical_entity
edible_fruit -> fruit -> reproductive_structure -> plant_organ -> plant_part -> natural_object -> unit -> object -> physical_entity

The following pairs of nouns are very far apart:

(distance = 23) white_marlin, mileage
(distance = 33) Black_Plague, black_marlin
(distance = 27) American_water_spaniel, histology
(distance = 29) Brown_Swiss, barrel_roll

The following synset has many paths to entity.
```
Ambrose Saint_Ambrose St._Ambrose
```
Also, we encourage you to use the small collection of sample files in the ftp directory.

Possible progress steps

Download the directory wordnet. It contains input files for ShortestCommonAncestor, WordNet, and Outcast.
Create the data type ShortestCommonAncestor. First, think carefully about designing a correct and efficient algorithm for computing the shortest common ancestor. Consult a staff member if you're unsure. In addition to the digraph*.txt files, design small rooted DAGs to test and debug your code. Modularize by sharing common code. Hint: do not attempt the additional performance requirements until you have working code using BreadthFirstDirectedPaths. This step will involve reimplementing breadth-first search.
Add code to ShortestCommonAncestor to detect whether a digraph is a rooted DAG. As defined in the assignment, a digraph is a rooted DAG if it is acyclic and has one vertex—the root—that is an ancestor of every other vertex.
Read in and parse the files described in the assignment, synsets.txt and hypernyms.txt. Don't worry about storing the data in any data structures yet. Test that you are parsing the input correctly before proceeding.
Create a data type WordNet. Divide the constructor into two (or more) subtasks (private methods).
- Read in the synsets file and build appropriate data structures. The file synsets.txt contains 82,192 synsets, composed from 119,188 nouns. Do not hardwire either of these numbers; your program must work for any valid synset file. Record the number of synsets for use when constructing the underlying digraph from the hypernyms file.
- Read in the hypernyms file and build a Digraph. The file hypernyms.txt corresponds to a rooted DAG with 82,192 vertices and 84,505 edges. Do not hardwire either of these numbers; your program must work for any valid hypernym file.
Implement the remaining WordNet methods.
Implement Outcast. This should be relatively straightforward by calling the appropriate methods from the WordNet data type.

A video is provided for those wishing additional assistance. Be forewarned that the video was made in early 2014 and is somewhat out of date. For example the API has changed.

Enrichment

This applet connects words by a chain of WordNet synonyms.
This paper measures the semantic orientation of WordNet adjectives by computing their relative distance to "good" and "bad."