Emily's Hints for KMeans:

I found the assignment page fairly confusing, so I wrote up this to help you guys out.

General hints:
A lot of this assignment is already filled in. Read what is there carefully: the code and the comments.

It would behoove you to create a toString() method for Gene and Cluster. This helps a ton with debugging.

Also, you should write a test client and THOROUGHLY test each method in each class before moving on to the next. Think about edge cases. How would someone try to break your code?

1.) Gene class
       - .equals() and .hashcode() are one-liners
       - euclideanDistance() is simple; just follow the formula. How do we compute sums with code?
       - spearmanDistance(); since we are in the Gene class, you can access other.exprRanks directly even though it is private. Alternatively, add a getter method to return exprRanks from an object of type Gene
       - you also need some type of transform for spearmanDistance(). The Spearman metric is always been (-1, 1), with 1 being perfectly correlated and -1 being anti-correlated. But you need to transform (-1, 1) into (Inf, 0) so that a larger value indicates more distance between two Genes, while 0 is perfectly correlated. Hint: there is more than one way to do this, and you will get very similar results either way.

2.) Cluster class
       - What is a HashSet? What does <Gene> means? Google Java HashSet and Java generic for some insight.
       - there are some other instance variables that you can add to this class to make life easier. What might they be? Where do you need to increment/calculate them?
       - The constructor is easy. Just initialize your HashSet and any extra instance vars you might add.
       - There is some code on the assignment page to make printGeneNames() work. If confused, do some reading about the built-in Iterator class in Java.
       - centroid(): you need to pull out the expression values for each gene in this cluster, then average the expression for each experiment. This gives you a new array of values that are the mean expression for the whole cluster. Use this array to initialize a "dummy" Gene called "centroid" and then return it.
       - addGene() is already filled out, but if you created new instance variables, this might be a good place to update them!
       - createJPG(): don't touch this! It works just fine. All you need to understand is what arguments it takes and what it does.

3.) KMeans class
       - What is contained in the given instance variables? Hint: you don't need to create any more unless you really want to.
       - Constructor: don't touch! It's good to go.
       - main: How do you initialize a KMeans object? How do you run a clustering? How do you get information on the cluster out? Finally, how do you make a .jpg of each cluster? Hint: look at the commented out code at the bottom.

KMeans algorithm

Step 1. initialize centroids randomly.
       - pick K random numbers between 0 and genes.length-1
       - store these K genes in an array; these are your starting centroids
       - you only do this step ONCE!

ENTER LOOP

Step 2. initialize/reset clusters
       - just loop over each cluster, and set each one equal to a fresh, empty cluster
       - in the first iteration you are initializing the cluster, after that you are resetting them

Step 3. assign genes to the closest centroid
       - you need to take each gene, and look at all the clusters' centroids to determine which one is closest. How do we find a minimum in code? The distance metric will be either spearman or euclidean; this is passed as an argument to the performClustering() method.
       - have some psuedo code:

for each gene i;
    mindist = Inf;
    for each cluster j;
       compute distance(gene, centroid of cluster j);
       if (distance < mindist)
            set j as new cluster for gene i;
            set mindist to distance;
    add gene i to cluster j;

Step 4. recompute centroids
       - you have now filled each cluster with some number of genes. so you have to recompute the centroid. Luckily you have a method that does this!
       - why do you have to recompute the centroid?

CHECK LOOP CONDITION: newly recomputed centroids != old centroids
RETURN to step 2 if centroids have changed. (Why?)
If centroids have not changed, exit loop.