COS 226 Programming Assignment

WordNet


WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym to plant root and plant root is a hypernym to carrot.

The WordNet DAG. Your first task is to build the WordNet graph: each vertex v is an integer that represents a synset, and each directed edge v->w represents that w is a hypernym of v. The graph is directed and acyclic (a DAG), though not necessarily a tree since each synset can have several hypernyms.

We now describe the two data files that you will use to create the WordNet digraph. The files are in CSV format: each line contains a sequence of fields, separated by commas.

Implement a data type WordNet with the following API.

// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms)

// return all of the glosses associated with a given noun's synsets
public Iterable<String> glosses(String noun)

// is the word a WordNet noun?
public boolean isNoun(String word)

 return the distance between nounA and nounB (defined below); -1 if infinite
public int distance(String nounA, String nounB)

// return the synset (second field of synsets.txt) that is the common ancestor
// of nounA and nounB in a shortest ancestral path; null if no such path (defined below)
public String sap(String nounA, String nounB)

Shortest ancestral path. An ancestral path between two vertices v and w in a digraph is a directed path from v to a common ancestor x, together with a directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length. For example, in the digraph at left, the shortest ancestral path between 3 and 11 has length 4 (with common ancestor 1). In the digraph at right, one ancestral path between 1 and 5 has length 4 (with common ancestor 5), but the shortest ancestral path has length 2 (with common ancestor 0).

Implement a data type SAP with the following API.

// constructor
public SAP(Digraph G)

// return length of shortest ancestral path of v and w; -1 if no such path
public int length(int v, int w)

// return a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w)
Also, include a main() that takes the name of a digraph input file (standard format used by Digraph) as a command-line argument, constructs the digraph, reads in vertex pairs from standard input, and prints out the length of a shortest ancestral path between them and a common ancestor that participates in a shortest ancestral path.
% more digraph1.txt             % java SAP digraph1.txt
13                               3 11
 7  3                            sap = 4, ancestor = 1
 8  3                            
 3  1                            9 12
 4  1                            sap = 3, ancestor = 5
 5  1
 9  5                            7 2
10  5                            sap = 4, ancestor = 0
11 10
12 10                            1 6
 1  0                            sap = -1, ancestor = -1
 2  0

Measuring the semantic relatedness of two nouns. Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two US presidents) are more related than are George Bush and chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.

We estimate the semantic relatedness of two nouns distance(A, B) as follows:

Using this notion of distance, add the methods distance() and sap() to the WordNet data type.

Outcast detection. Given a list of nouns A1, A2, ..., An, which noun is the least related to the others? To identify the outcast, compute the sum of the squares of the distance between each noun and every other one:

(di)2 = (dist(Ai, A1))2 + (dist(Ai, A2))2 + ... + (dist(Ai, An))2
and return the noun At for which dt is maximum.

Implement a data type Outcast with the following API

// constructor
public Outcast(WordNet wordnet)

// given an array of nouns, return the outcast
public String outcast(String[] nouns)
Also include a main() that takes the name of a synset file, the name of a hypernym file, followed by the names of outcast files, and prints out the outcast in each file.
% more outcast1.txt
5
horse zebra cat bear table

% more outcast2.txt
11
apple pear peach banana lime lemon blueberry strawberry mango watermelon potato

% more outcast3.txt
8
water soda bed orange_juice milk apple_juice tea coffee

% java Outcast synsets.txt hypernyms.txt outcast1.txt outcast2.txt outcast3.txt
outcast1.txt: table
outcast2.txt: potato
outcast3.txt: bed

Create a few sample input outcast files (in the format above) and submit with your assignment. Include at least one where the method above performs well and one where it does not seem to perform well.

Deliverables. Submit WordNet.java, SAP.java, and Outcast.java that implement the APIs described above, along with a a readme.txt file. Also submit any other supporting files (excluding those in stdlib.jar and adt.jar).



This assignment was created by Alina Ene and Kevin Wayne.
Copyright © 2006