Emily's Hints for KMeans:
I found the assignment page fairly confusing, so I wrote up this to help you guys out.
General hints:
A lot of this assignment is already filled in. Read what is there carefully: the code and the comments.
It would behoove you to create a toString() method for Gene and Cluster. This helps a ton with debugging.
Also, you should write a test client and THOROUGHLY test each method in
each class before moving on to the next. Think about edge cases. How
would someone try to break your code?
1.) Gene class
- .equals() and .hashcode() are one-liners
- euclideanDistance() is simple; just follow the formula. How do we compute sums with code?
- spearmanDistance(); since we are in
the Gene class, you can access other.exprRanks directly even though it
is private. Alternatively, add a getter method to return exprRanks from
an object of type Gene
- you also need some type of transform
for spearmanDistance(). The Spearman metric is always been (-1, 1),
with 1 being perfectly correlated and -1 being anti-correlated. But you
need to transform (-1, 1) into (Inf, 0) so that a larger value
indicates more distance between two Genes, while 0 is perfectly
correlated. Hint: there is more than one way to do this, and you will
get very similar results either way.
2.) Cluster class
- What is a HashSet? What does
<Gene> means? Google Java HashSet and Java generic for some
insight.
- there are some other instance
variables that you can add to this class to make life easier. What
might they be? Where do you need to increment/calculate them?
- The constructor is easy. Just
initialize your HashSet and any extra instance vars you might add.
- There is some code on the assignment
page to make printGeneNames() work. If confused, do some reading about
the built-in Iterator class in Java.
- centroid(): you need to pull out the
expression values for each gene in this cluster, then average the
expression for each experiment. This gives you a new array of values
that are the mean expression for the whole cluster. Use this array to
initialize a "dummy" Gene called "centroid" and then return it.
- addGene() is already filled out, but
if you created new instance variables, this might be a good place to
update them!
- createJPG(): don't touch this! It
works just fine. All you need to understand is what arguments it takes
and what it does.
3.) KMeans class
- What is contained in the given
instance variables? Hint: you don't need to create any more unless you
really want to.
- Constructor: don't touch! It's good to go.
- main: How do you initialize a KMeans
object? How do you run a clustering? How do you get information on the
cluster out? Finally, how do you make a .jpg of each cluster? Hint:
look at the commented out code at the bottom.
KMeans algorithm
Step 1. initialize centroids randomly.
- pick K random numbers between 0 and genes.length-1
- store these K genes in an array; these are your starting centroids
- you only do this step ONCE!
ENTER LOOP
Step 2. initialize/reset clusters
- just loop over each cluster, and set each one equal to a fresh, empty cluster
- in the first iteration you are initializing the cluster, after that you are resetting them
Step 3. assign genes to the closest centroid
- you need to take each gene, and look
at all the clusters' centroids to determine which one is closest. How
do we find a minimum in code? The distance metric will be either
spearman or euclidean; this is passed as an argument to the
performClustering() method.
- have some psuedo code:
for each gene i;
mindist = Inf;
for each cluster j;
compute distance(gene, centroid of cluster j);
if (distance < mindist)
set j as new cluster for gene i;
set mindist to distance;
add gene i to cluster j;
Step 4. recompute centroids
- you have now filled each cluster with
some number of genes. so you have to recompute the centroid. Luckily
you have a method that does this!
- why do you have to recompute the centroid?
CHECK LOOP CONDITION: newly recomputed centroids != old centroids
RETURN to step 2 if centroids have changed. (Why?)
If centroids have not changed, exit loop.