CS 302: HW#4, Bayes and Viterbi

COS 302: Introduction to Artificial Intelligence

Homework #4

Bayes and Viterbi

Fall 2003

Due: Monday, November 17

Part I: Written Exercises

Turn these in at the end of class or in room 001A Computer Science on or before the due date. Problems 1 and 3 will be worth around 10 points each; problems 2, 4 and 5 will be worth around 15 points each. Be sure to show your work and justify all of your answers.

1. There are three condemned prisoners A, B and C. The governor has announced that one of the three, chosen at random, has been pardoned, but doesn't say which. Prisoner A, realizing that he only has a 1/3 chance of having been pardoned and anxious to learn his fate, reasons with the warden as follows: "Please tell me the name of one of the other prisoners B or C who will be executed. I already know that at least one of them will be executed so you will not be divulging any information." The warden thinks for a minute and then asks how he should choose between B or C in case both are to be executed. "In that case," A tells him, "simply flip a coin (when I'm not around) to choose randomly between the two." Reluctantly, the warden agrees and the next day tells A that B will be executed. On hearing this news, A smiles and thinks to himself, "What a fool this warden is to fall for my trick. Now my chances of having been pardoned have increased from 1/3 to 1/2" (since either A or C must have been pardoned).

Was A correct in his final assertion? Use Bayes rule to compute the probability that A was pardoned given that the guard tells him that B will be executed. (Note that this is not the same as the probability that A was pardoned given that B will be executed (i.e., was not pardoned). This is because the event that "the guard tells A that B was not pardoned" is different from the event that "B was not pardoned". This is a tricky problem so be careful to distinguish these two events.)
Suppose, in the event that the warden has to choose between B or C, that he chooses B with probability p (rather than 1/2 as above). Now what is the probability that A was pardoned (in terms of p) given, as before, that the guard tells him that B will be executed? (If you wish, you may answer part b first, and then answer part a by setting p=1/2.)

2. Exercise 14.2 in R&N.

3. Consider the burglar alarm example in R&N Figure 14.2 (and discussed extensively in class). In R&N (page 505), it is shown that the probability of a burglary, given that both John and Mary call, is roughly 28.4%. Suppose now that you hear on the radio that there was an earthquake in your area. Now what is the probability of a burglary (given that JohnCalls, MaryCalls and Earthquake are all equal to true)? Has the probability of a burglary increased or decreased as a result of this new information? Intuitively, why does this make sense? (This phenomenon is called "explaining away".)

4. (This is more or less a rephrasing of R&N Exercise 14.10a.) In a Bayes net, prove that the conditional probability of a variable given the settings of all other variables is equal to its conditional probability given only the settings of those variables in its Markov blanket. (Hint: use R&N Eq. (14.1) combined with marginalization and the definition of conditional probability to compute both of these quantities and to show that they are equal.)

5. Let G be an undirected graph. Consider a random walk on such a graph in which we begin at a designated start vertex, and proceed, at each time step, to traverse to a randomly selected neighbor of the current vertex. In other words, at each time step, we traverse one of the edges (selected at random) that is incident to the vertex we currently occupy. Let d(v) be the degree of vertex v, i.e., the number of edges that are incident to vertex v. Let m be the total number of edges in the graph.

What is the transition probability function q(v --> v') for this random walk?
Let p(v) = d(v)/(2m). Show that p is a stationary distribution for this random walk, i.e., that it satisfies the stationarity equation (R&N Eq. (14.9)).
Consider a chess king taking a random walk on an ordinary chess board. At each time step, he can move one step in any direction, horizontally, vertically or diagonally. What is the stationary distribution p in this case?

Part II: Programming

This assignment is about hidden Markov models (HMMs) and their many potential applications. The main components of the assignment are the following:

implement a method to build an HMM from data,
implement the Viterbi algorithm for finding the most likely sequence of states through the HMM, given "evidence",
run your code on several datasets and explore its performance.

There is also an optional part to this assignment involving second-order Markov models, as described below.

Building an HMM from data

The first part of the assignment is to build an HMM from data. Recall that an HMM involves hidden state that changes over time, as well as observable evidence, henceforth called the output of the HMM. An HMM is defined by three sets of probabilities:

for each state s, the probability of observing each output o at that state (in the notation of R&N, this is P(E[t]=o | X[t]=s))
from each state s, the probability of traversing to every other state s' in one time step (P(X[t+1]=s' | X[t]=s))
a distribution over the start state (P(X[0])).

Regarding item 3, in this assignment, we will assume that there is a single dummy start state, distinct from all other states, and to which the HMM can never return. Even so, you will need to estimate the probability of making a transition from this dummy start state to each of the other states (this is implicit in part 2, but may need to be done explicitly by your program).

For items 1 and 2, your job will be to compute estimates of these probabilities from data. We are providing you with training data consisting of one or more sequences of state-output pairs, i.e., sequences of the form x[1], e[1], x[2], e[2], ..., x[n], e[n]. During this training phase, we assume that the state variables are visible. Given these sequences, you need to estimate the probabilities that define the HMM. For instance, to estimate the probability of output o being observed in state s, you might simply count up the number of times that output o appears with state s in the given data, and divide by a normalization constant (so that the probabilities of all outputs from that state add up to one). In this case, that normalization constant would simply be the number of times that state s appears at all in the data.

Although this approach corresponds to the meaning of a conditional probability, when making estimates of this sort, it is often preferable to smooth the estimates. To see what I mean, consider flipping a coin for which the probability of heads is p, where p is unknown, and our goal is to estimate p. The obvious approach is to count how many times the coin came up heads and divide by the total number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very reasonable to estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get heads both times. Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit rash to conclude that the coin will always come up heads, and smoothing is a way of avoiding such rash conclusions. A simple smoothing method, called Laplace smoothing or Laplace's law of succession, is to estimate p by (one plus the number of heads) / (two plus the total number of flips). Said differently, if we are keeping count of the number of heads and the number of tails, this rule is equivalent to starting each of our counts at one, rather than zero. This latter view generalizes to the case in which there is more than two possible outcomes (for instance, estimating the probability of a die coming up on each of its six faces).

Another advantage of Laplace smoothing is that it avoids estimating any probabilities to be zero, even for events never observed in the data. For HMMs, this is important since zero probabilities can be problematic for some algorithms.

So, for this assignment, you should use Laplace-smoothed estimates of probabilities. For instance, returning to the problem of estimating the probability of output o being observed in state s, you would use one plus the number of times output o appears in state s in the given data, divided by a normalization constant. (In this case, the normalization constant would be the number of times state s appears in the data, plus the total number of possible outputs. However, there is really no need to work this constant out explicitly since your code can do the normalization numerically.)

You will need to also work out Laplace-smoothed estimates for item 2, i.e., for the probability of making a transition from one state to another, as well as the probability of making a transition from the dummy start state to any of the other states.

Finding the most likely sequence

The second part of the assignment is to write code that computes the most probable sequence of states (according to the HMM that you built from data) for a given sequence of outputs. This is essentially the problem of implementing the Viterbi algorithm as described in R&N.

The second part of each of the provided datasets consists of test sequences of state-output pairs. Your Viterbi code will be provided with just the output part of each of these sequences, and from this, must compute the most likely sequence of states to produce such an output sequence. The state part of these sequences is provided so that you can compare the estimated state sequences generated by your code to the actual state sequences that generated this data. Note that these two sequences will not necessarily be identical, even if you have correctly implemented the Viterbi algorithm.

A numerical tip: the Viterbi algorithm involves multiplying many probabilities together. Since each of these numbers is smaller than one (possibly much smaller), you can end up working with numbers that are tiny enough to be indistinguishable from zero by a real computer. To avoid this problem, it is recommended that you work with log probabilities. So, rather than multiplying two probabilities p and q, simply add their logarithms using the rule

log(pq) = log(p) + log(q).

You will probably find that there is never a need to store or manipulate actual probabilities; instead, everything can be done using log probabilities.

Exploring performance on actual datasets

We are providing data for three problems that HMMs can be used to solve:

a robot toy problem in which the goal is to infer the sequence of locations visited by the robot based on sensor readings,
the problem of correcting typographical errors, and
the problem of inferring and tracking changing topics in a stream of text.

For each of these problems, you should run your program and examine the results, exploring how, why and when it works. Then you should write up briefly (say, in 1-3 paragraphs for each problem) what you found. Your write-up should include any observations you may have made about how well HMMs work on these problems and why. Your observations can be quantitative (for instance, the number of errors was x) or anecdotal (for instance, the method worked really well at correcting these typos, but not these other ones). You might take anecdotal observations as a starting point for making some particular quantitative measurement (for instance, after noticing that the method seems to fail in particular situations, you might want to write a program that will systematically check if this is indeed the case). Be critical and objective, pointing out both successes and failures. Try to think of plausible explanations for your observations, and in the case of failures, try to think of modifications of our approach that might lead to greater success. Although this write-up is quite open ended, you should be sure to discuss what probabilistic assumptions we are making about the nature of the data in using a hidden Markov model, how well those assumptions match the actual process generating the data, and to what extent performance was or was not hurt by making such realistic or unrealistic assumptions.

If you do the optional part of this assignment, your write-up should incorporate observations on the use of second-order Markov models. (Even if you don't do the optional part, you might want to speculate in your write-up why second-order Markov models might or might not improve performance.)

Toy robot

In problem 1, a robot is wandering through the following small world:

The robot can only occupy the colored squares. At each time step, the robot attempts to move up, down, left or right, where the choice of direction is made at random. If the robot attempts to move onto a black square, or to leave the confines of its world, its action has no effect and it does not move at all. The robot can only sense the color of the square it occupies. However, its sensors are only 90% accurate, meaning that 10% of the time, it perceives a random color rather than the true color of the currently occupied square. The robot begins each walk in a randomly chosen colored square.

In this problem, state refers to the location of the robot in the world in x:y coordinates, and output refers to a perceived color (r, g, b or y). Thus, a typical random walk looks like this:

3:3 r 3:3 r 3:4 y 2:4 b 3:4 y 3:3 r 2:3 b 1:3 g 2:3 b 2:4 r 3:4 y 4:4 y

Here, the robot begins in square 3:3 perceiving red, attempts to make an illegal move (to the right), so stays in 3:3, still perceiving red. On the next step, the robot moves up to 3:4 perceiving yellow, then left to 2:4 perceiving blue (erroneously), and so on.

By running your program on this problem, you will build an HMM model of this world. Then, given only sensor information (i.e., a sequence of colors), your program will re-construct an estimate of the actual path taken by the robot through its world.

The data for this problem is in robot_no_momentum.data, a file containing 200 training sequences (random walks) and 200 test sequences, each sequence consisting of 200 steps.

We also are providing data on a variant of this problem in which the robot's actions have "momentum" meaning that, at each time step, with 85% probability, the robot continues to move in the direction of the last move. So, if the robot moved (successfully) to the left on the last move, then with 85% probability, it will again attempt to move left. If the robot's last action was unsuccessful, then the robot reverts to choosing an action at random. Data for this problem is in robot_with_momentum.data.

Correcting typos

Problem 2 deals with the problem of correcting typos in text. Here, you will be given text containing many typographical errors and the goal is to correct as many typos as possible.

In this problem, state refers to the correct letter that should have been typed, and output refers to the actual letter that was typed. Given a sequence of outputs (i.e., actually typed letters), the problem is to reconstruct the hidden state sequence (i.e., the intended sequence of letters). Thus, data for this problem looks like this:

i i n n t t r e o o d d u u c x t t i i o o n b _ _ t t h h e e _ _

where the left column is the correct text and the right column contains text with errors.

Data for this problem was generated as follows: we started with a text document, in this case, the Unabomber's Manifesto, which was chosen not for political reasons, but for its convenience being available on-line and of about the right length. For simplicity, all numbers and punctuation were converted to white space and all letters converted to lower case. The remaining text is a sequence only over the lower case letters and the space character, represented in the data files by an underscore character. Next, typos were artificially added to the data as follows: with 90% probability, the correct letter is transcribed, but with 10% probability, a randomly chosen neighbor (on an ordinary physical keyboard) of the letter is transcribed instead. Space characters are always transcribed correctly. In a harder variant of the problem, the rate of errors is increased to 20%. The first (roughly) 20,000 characters of the document have been set aside for testing. The remaining 161,000 characters are used for training.

As an example, the original document begins:

introduction the industrial revolution and its consequences have been a disaster for the human race they have greatly increased the life expectancy of those of us who live in advanced countries but they have destabilized society have made life unfulfilling have subjected human beings to indignities have led to widespread psychological suffering in the third world to physical suffering as well and have inflicted severe damage on the natural world the continued development of technology will worsen the situation it will certainly subject human beings to greater indignities and inflict greater damage on the natural world it will probably lead to greater social disruption and psychological suffering and it may lead to increased physical suffering even in advanced countries the industrial technological system may survive or it may break down if it survives it may eventually achieve a low level of physical and psychological suffering but only after passing through a long and very painful period of adjustment and only at the cost of permanently reducing human beings and many other living organisms to engineered products and mere cogs in the social machine

With 20% noise, it looks like this:

inteoduxtiob the insuarrial rwciluruon abd ita xobsequences hqcw been a diaaster for rhw gumab racw they hacw greatlt uncreased thw life expectabct of thise of ua who livw un advanxed countriws but rhwy hqve searabulized aociwtt gqve made lifw unfyldillung gqvw subjexted human geings ti indignitues have ked ro widespread psyxhologuxal sufferibg in rhe thurd worls to ohyaical suffwribg as well abd hace ibflucted severe damagw ib the nqtueal wirld tge cintibues dwvekopmwnt of rechnology qukl qorswn tgw sutuqtion it will cerrainly subjwct gunab bwungs to greqtwe indigbities and ibflict grearer samage in the batural worls it wikl probablt lead ro geeatwr sociql disruption and psyxgological syffwribg and ir may leqs ro uncreaaed physical suffering even un advqnced xounteiea rhe insustrual technological systen mat survice or it mqy brwak doqb uf it survives it may ecentualky achueve a low level of ohysuxal and oatcholigixal syffering but inky aftwr oasaing throygh a kong ans ceey oaibfyl oeriod of adjuatment and inly qt the cosr od pwrmanently reduxinf human beunga and nqby other living organisna ti ebgineered oroducta qnd mere cogs in the social mqxhibw

The error rate (fraction of characters that are mistyped) is about 16.5% (less than 20% because space characters were not corrupted).

The text reconstructed using an HMM with the Viterbi algorithm looks like this:

inteoduction the insuatrial recilurion and its condequences hace been s dizaster for the buman race they hace greatly increased the life expectanct of thise of is who live in advanced countries but they have searabulized society have made life inguldilling have subjexted human beings to indignities have led to widespread psychological suffering in the thurd worls to phusical suffering as well and hace inglicted severe damage in the matieal wirld the cintibues development of rechnology qull worsen the sutiation it will cerrainly subject buman beings to greatee indignities and inglict grearer samage in the batural worls it will pronably lead to berater social distuption and psychological suffering and ir may leas to increased phusical suffering even in advanced counteies the insustrial technological systen mat survice or it may breal down if it survives it may ecentially achieve s low level of phusucal and patcholigical suffering but inly after passing through s long ans cery osingul period of adjuatment and inly at the cost od permanently reducing human beinga and maby other living organisma to engineered oroducts and mere cogs in the social machine

The error rate has dropped to about 7.8%.

If you do the extra credit part of this assignment which involves building a second-order Markov process, you will get reconstructed text that looks like this:

introduction the industrial revolution and its consequences have been a disaster for the human race they have greatly increased the life expectanct of thise of is who live in advanced countries but they have searabilized society have made life infulfilling have subjexted human beings to indignities have led to widespread psychological suffering in the third world to physical suffering as well and have inflicted devere damage in the natural world the continues development of technology will worden the situation it will certainly subject human beings to greater indignities and inflict greater samage in the natural world it will probably lead to greater social disruption and psychological suffering and it may leas to uncreased physical suffering even in advanced countries the industrial technological systen may survive or it may break down if it survives it may eventually achieve a low level of physical and psychological suffering but inly after passing through a long and cery painful ortiod of adjustment and inly at the cost of permanently reducing human beings and many other living organisms to engineered products and mere cohs in the social machine

The error rate now has dropped even further to about 3.3%.

Data for this part of the assignment is in typos10.data and typos20.data, representing data generated with a 10% or 20% error rate, respectively.

Tracking a changing topic

Problem 3 deals with the problem of tracking a changing topic, for instance, in a conversation or while watching the news. Your program will be provided with a stream of words, first on one topic, then on another topic, and so on. The goal is to segment the stream of text into blocks, and to identify the topic of each of these blocks. There are six possible topics, namely, baseball, cars, guns, medicine, religion and windows (as in Microsoft). A state in this problem is an underlying topic. An output is an actual word appearing in the text. So for instance, data might look like this:

baseball when baseball are baseball the baseball yankees baseball planning baseball on baseball activating baseball melido baseball perez medicine one medicine of medicine my medicine friends medicine has medicine had medicine to medicine go medicine to medicine the medicine doctor medicine because medicine he medicine had medicine chest medicine pains guns what guns to guns do guns if guns you guns shoot guns somebody

To simplify the data file, lines on which the output changes but not the state can be combined. Thus, in terms of coding the data, the above data is exactly equivalent to the following:

baseball when are the yankees planning on activating melido perez medicine one of my friends has had to go to the doctor medicine because he had chest pains guns what to do if you shoot somebody

Note that there are many possible outputs for this problem, essentially one for every word. Your code should be efficient enough to handle such a large number of possible outputs.

Data for this problem was created as follows: the bodies of 5302 articles were collected from six newsgroups (namely, rec.sports.baseball, rec.auto, talk.politics.guns, sci.med, talk.religion.misc and comp.os.ms-windows.misc) corresponding to the six topics. All punctuation was removed (converted to white space) and all letters converted to lower case. The articles were then randomly permuted and concatenated together forming a sequence of states (topics) and outputs (words). 1500 articles were saved for use in testing, the rest for training.

Using an HMM with the Viterbi algorithm on this data will produce a sequence of topics attached to each of the words. For instance, in the case above, we might get something like the following (I made this up -- this is not based on an actual run):

baseball when baseball are baseball the baseball yankees baseball planning baseball on baseball activating baseball melido baseball perez baseball one baseball of baseball my baseball friends medicine has medicine had medicine to medicine go medicine to medicine the medicine doctor medicine because medicine he medicine had guns chest guns pains guns what guns to guns do guns if guns you guns shoot guns somebody

This corresponds to breaking up the text (rather imperfectly) into three blocks: "when are the yankees planning on activating melido perez one of my friends" with topic baseball; "has had to go to the doctor because he had" with topic medicine; and "chest pains what to do if you shoot somebody" with topic guns.

Data for this part of the problem is in the file topics.data.

Second-order Markov models (optional)

For extra credit, you can redo the entire assignment using a second-order Markov model. Recall that in such a model, the next state depends not only on the current state, but also on the last state. Thus, X[t+1] depends both on X[t] and on X[t-1]. (However, the evidence E[t] still depends only on the current state X[t].) These models are described in R&N. You will need to modify your representation of an HMM, and you will also need to modify the Viterbi algorithm. Your implementation should be fast enough for us to test on the kind of data provided in a reasonable amount of time. For this optional part of the assignment, you should also run your second-order model on some of the datasets and comment on its performance in your write-up.

The code we are providing

We are providing a class called DataSet that loads data from a file and stores it in a number of data structures available as public fields of the class. The class has a constructor taking a file name as argument that reads in the data from the file. Each state is represented by an integer in the range 0 (inclusive) to numStates (exclusive), where numStates is the total number of states. (These do not include the dummy start state.) Outputs are represented in a similar fashion. The stateName and outputName arrays convert these integers back to strings in the obvious fashion. The array of arrays trainState represents the set of sequences of states observed during training. Thus, trainState[i][j] is the value of the j-th state in the i-th training sequence. Similarly for trainOutput, testState and testOutput.

Data files consist of one or more sequences of training state-output pairs, followed by one or more sequences of testing state-output pairs. Each sequence is separated by a line consisting of a single period. Training and testing data are separated by a line consisting of two periods. For instance, here is a simple, sample file:

a 0 b 1 0 0 a 1 . a 1 1 .. b 0 1 a 1 . a 0 a 0

In this tiny example, there are two states, a and b, and two outputs 0 and 1. There are two training sequences: (a,0),(b,1),(b,0),(b,0),(a,1) and (a,1),(a,1). There are two test sequences: (b,0),(b,1),(a,1) and (a,0),(a,0).

We also are providing a class called Test consisting only of a main. This file is provided as a starting point for testing your code. You will probably want to modify this file, or write your own. Running Test will call the DataSet constructor with the file name provided on the command line; will then use the data in this file to create an HMM (using your code); will print out the HMM (unless it is too big); and will finally call your code to compute the most likely state sequence on each of the test output sequences.

If run on the sample file above, the output would look like this:

Start probabilities: a : .750 b : .250 Transition probabilities: a b a : .500 .500 b : .400 .600 Output probabilities: 0 1 a : .333 .667 b : .600 .400 sequence 0: b a 0 b a 1 a a 1 errors: 2 / 3 = 0.6666666666666666 sequence 1: a a 0 a b 0 errors: 1 / 2 = 0.5

The first table shows the probability of a transition from the dummy start state to each of the states (in this case, a and b). The second table shows the probability of transitioning from each state to every other state. The third table shows the probability of each output being observed in each state. Finally, each test sequence is printed. The first column shows the correct state; the second column shows the state sequence estimated by your code; the third column shows the output sequence. Finally the number of errors (where columns one and two disagree) is printed.

To debug your code, you will surely want to run it on small data files of this kind before running on the large datasets provided. Also, when running on large datasets, you may need to increase the maximum heap size allocated to your program. This option is platform dependent (use the -X option to find out what it is on yours), but often is something like -Xmx512M, which will give your program access to 512MB of memory.

The code that you need to write

Your job is to fill in the constructor and all of the methods in the template Hmm.java file that is being provided. Part 1 of this assignment belongs in the buildFromData method. Part 2 goes in the mostLikelySequence method. There are other simple methods in this template that you also need to write for probing the HMM. In addition, you need to write a method buildFromTable which builds an HMM with a specified set of probabilities. This should be simple to write. We are asking you to write these latter methods so that we can effectively and automatically test your code.

If you are doing the optional part of this assignment, you also should write a class Hmm2.java. This class should include a constructor, and the methods buildFromData and mostLikelySequence with signatures identical to those in Hmm.java. However, you do not need to provide any of the other methods.

The final code that you turn in should not write anything to standard output or standard error.

All code and data files can be obtained from this directory, or all at once from this tar file. Documentation is available here.

What to turn in

Using whiteboard, you should turn in the following:

A class Hmm.java with the constructor and all of the methods filled in.
If you are doing the optional part of this assignment, a class Hmm2.java as described above.
Any other java files that you wrote and used, or that are needed by your code.
A readme.txt file explaining briefly how your code is organized, what data structures you are using, or anything else that will help the TA's understand how your code works.

In addition, you should turn in your write-up exploring the provided datasets as described above. This written work should be handed in with the written exercises.

What you will be graded on

We will automatically test your Hmm code on data similar (but not identical) to the data provided with this assignment. Your grade will depend largely on getting the right answers. In addition, your code should be efficient enough to run reasonably fast (easily under a minute on each of the provided data sets on a 2.4GHz machine), and should not terminate prematurely with an exception (unless given bad data, of course); code that does otherwise risks getting no credit for the automatic testing portion of the grade. As usual, you should follow good programming practices, including documenting your code.

Your write-up should be clear, concise, thoughtful and perceptive.

This programming assignment will be worth about 80 points (about 35 based on automatic testing of your code, 10 for documentation and good programming practice, and 35 for the write-up). The optional component will be worth roughly 20-25 points.