Can I read the synset or hypernym file twice? No. File I/O is very expensive; read each file only once and organize the data in appropriate data structures.
Any advice on how to read and parse the synset and hypernym data files?
readLine() method in our
library to read the data one line at a time.
split() method in Java's
to divide a line into fields. You can find an example using
String id numbers into
int values or
Integer.valueOf() to convert into
Which data structures should I use to store the synsets, synset ids, and hypernyms? This part of the assignment is up to you. You must carefully select data structures to achieve the specified performance requirements.
I want a symbol table that maps each key to multiple values. How can I accomplish that? You can use a standard symbol whose value type is a collection (such as a queue).
sca() return if is there is a tie for the shortest commmon ancestor?
The API does not specify, so you are free to return any shortest common ancestor.
nouns() return each distinct noun once? Or multiple times
if the noun appears in multiple synsets?
The API says to return the set of nouns, so no duplicates.
nouns() return the nouns is alphabetical order?
The API does not specify, so you are free to return them in any order.
Do I need to store the glosses? No, you won't use them on this assignment.
How large is the WordNet DAG? The WordNet DAG has 83,127 synsets, 85,441 hypernyms, and 120,119 distinct nouns. Your program must work for any valid synset and hypernym files, so do not hardwire these constants in your program.
What is the root synset for the WordNet DAG?
38938,entity,that which is perceived or known or inferred to have its own distinct existence (living or nonliving)
Can a synset contain more than one noun?
Yes, a synset is a set of one or more nouns that are interchangeable in some context.
ASL American_sign_language is a synset containing the
Can a noun contain underscore characters?
Yes. For example,
American_sign_language is one noun.
Can a noun contain spaces? No.
Can a noun appear in more than one synset?
Yes, absolutely. It will appear once for each of the noun’s distinct meanings.
For example, the noun
word appears in these 8 synsets:
Do not assume that the number of synsets in which a noun participates is bounded by a constant. The noun36467,discussion give-and-take word,an exchange of views on some topic; "we had a good discussion"; "we had a word or two about it" 57522,news intelligence tidings word,information about recent and important events; "they awaited news of the outcome" 60202,parole word word_of_honor,a promise; "he gave his word" 60400,password watchword word parole countersign,a secret word or phrase known only to a restricted group; "he forgot the password" 82510,word,a brief statement; "he didn't say a word about it" 82511,word,a string of bits stored in computer memory; "large computers use words up to 64 bits long" 82512,word,a unit of language that native speakers can identify; "words are the blocks from which sentences are made"; "he hardly said ten words all morning" 82513,word,a verbal command for action; "when I give the word, charge!"
headappears in 33 synsets.
Can a synset consist of exactly one noun? Yes. Moreover, there can be several different synsets that consist of the same noun, each corresponding to a distinct meaning of that noun.
66,Aberdeen,a city in northeastern Scotland on the North Sea 67,Aberdeen,a town in northeastern Maryland 68,Aberdeen,a town in northeastern South Dakota 69,Aberdeen,a town in western Washington
I'm an ontologist and I noticed that your
hypernyms.txt file contains both is-a and
Yes, you caught us. This ensures that every noun (except
has a hypernym. Here is an article on the
Can I use my own Digraph class? No. You must use Digraph.java. That is the type of the argument to the constructor.
How can I make the data type
You can (and should) save the associated digraph in an instance variable.
Digraph is mutable,
you must first make a defensive copy
(e.g., by calling the copy constructor in
Is a vertex considered an ancestor of itself? Yes.
ancestor() return if is there is a tie
for the shortest common ancestor?
The API does not specify, so you are free to return any shortest common ancestor.
I understand how to compute the
in \(\Theta(E+V)\) time in the worst case but my
method takes \(\Theta(a \times b \times (E + V))\) time,
where a and b are the sizes of the two iterables. How can
I improve it to be \(\Theta(E + V)\) time?
The key is use a multi-source version of breadth-first search, as in the
the constructor in
BreadthFirstDirectedPaths that accepts an iterable
of sources as an argument (instead of a single source).
Should I construct a new
for each call to
No. You need only one
ShortestCommonAncestor object per
distance() should make
one call to either
How can I determine whether a digraph is a rooted DAG? This is a challenge for you to solve. Hint: break it up into two parts: (1) check if a digraph is a DAG and (2) check if a DAG is a rooted DAG.
Should I re-implement depth-first search to find directed cycles? No. Instead, use either DirectedCycle.java or Topological.java.
Should I re-implement breadth-first search to compute shortest common ancestors? Unless you are attempting the extra credit, you should use BreadthFirstDirectedPaths.java. To earn the extra credit, however, you'll need to re-implement breadth-first search.
Do I need to throw exceptions explicitly with a throw statement?
No, it's fine if they are thrown implicitly.
For example, you can rely on any method in
to throw an
if passed a vertex argument outside of the prescribed range.
A good API documents the requisite behavior for all possible arguments
but, hopefully, you should not need much extra code to deal with these corner cases.
For the extra credit, do
need to take time proportional to
to the number of vertices and edges reachable from the argument vertices
in the worst case? Or, may I use hashing?
You can make standard technical assumptions (such as the uniform hashing assumption).
If you do so, state any assumptions that you make in your
outcast() return if is there is a tie for the outcast?
The API does not specify, so you are free to return any outcast.
My algorithm computes the distance between every pair of nouns. Is that okay? Yes, that's fine.
Some examples. Here are some interesting examples that you can use to test your code.
President(in capitalized form) appears in two different synsets:
14479,President_of_the_United_States President Chief_Executive,the office of the United States .... 14480,President_of_the_United_States United_States_President President Chief_Executive,the person who holds the office .....
municipalityhas two paths to
municipality -> administrative_district -> district -> region municipality -> populated_area -> geographic_area -> region
edible_fruithave several different paths to their common ancestor
individual -> organism being -> living_thing animate_thing -> whole unit -> object physical_object -> physical_entity person individual someone somebody mortal soul -> causal_agent cause causal_agency -> physical_entity edible_fruit -> garden_truck -> food solid_food -> solid -> matter -> physical_entity edible_fruit -> fruit -> reproductive_structure -> plant_organ -> plant_part -> natural_object -> unit -> object -> physical_entity
(distance = 23) white_marlin, mileage (distance = 33) Black_Plague, black_marlin (distance = 27) American_water_spaniel, histology (distance = 29) Brown_Swiss, barrel_roll
Ambrose Saint_Ambrose St._Ambrose
ShortestCommonAncestor. First, think carefully about designing a correct and efficient algorithm for computing the shortest common ancestor. Consult a staff member if you're unsure. In addition to the
digraph*.txtfiles, design small rooted DAGs to test and debug your code. Modularize by sharing common code.
ShortestCommonAncestorto detect whether a digraph is a rooted DAG. As defined in the assignment, a digraph is a rooted DAG if it is acyclic and has one vertex—the root—that is an ancestor of every other vertex.
hypernyms.txt. Don't worry about storing the data in any data structures yet. Test that you are parsing the input correctly before proceeding.
WordNet. Divide the constructor into two (or more) subtasks (private methods).
synsets.txtcontains 83,127 synsets, composed from 120,119 nouns. Do not hardwire either of these numbers; your program must work for any valid synset file. Record the number of synsets for use when constructing the underlying digraph from the hypernyms file.
Digraph. The file
hypernyms.txtcorresponds to a rooted DAG with 83,127 vertices and 85,441 edges. Do not hardwire either of these numbers; your program must work for any valid hypernym file.
Implement the remaining
Outcast. This should be relatively straightforward
by calling the appropriate methods from the
WordNet data type.