COS 233

K-Means Clustering
Programming Assignment

Pair programming. On this assignment, you are encouraged (not required) to work with a partner provided you practice pair programming. Pair programming "is a practice in which two programmers work side-by-side at one computer, continuously collaborating on the same design, algorithm, code, or test." One partner is driving (designing and typing the code) while the other is navigating (reviewing the work, identifying bugs, and asking questions). The two partners switch roles every 30-40 minutes, and on demand, brainstorm.

Before pair programming, you must read the article All I really need to know about pair programming I learned in kindergarten. You should choose a partner of similar ability.

Only one partner submits the code and readme.txt; the other partner submits only an abbrievated readme.txt that contains both partners' names, logins and information about extra credit. The names and logins of both partners MUST appear at the top of every submitted file. Both partners receive the same grade.

Writing code with a partner without following the pair programming instructions listed above is a serious violation of the course collaboration policy.

For this assignment, you will write a program to cluster genes from a real microarray dataset using k-means clustering. As you know from lecture, k-means clustering aims to divide a set of genes into k clusters, with k determined in advance by the user (that's you). The goal of the k-means algorithm is to minimize

the sum of the squared distances between data points xi(j) that belong to cluster j with cluster center cj. In our case, we'll try to minimize the distance between gene expression vectors in each cluster and their centroids (vectors that define the cluster center or average).

K-Means Algorithm. Input a cluster count K and N genes each with a name and its own vector of M microarray measurements (the N genes are often referred to as rows and the M microarray conditions as experiments or columns). Then:

  1. Initialize K centroids (See "Progress steps and testing" below). In our case, a centroid is just the average of all genes in a cluster, and it can be represented just like a normal gene.
  2. Assign each gene to the cluster that has the closest centroid.
  3. After all genes have been assigned to clusters, recalculate the centroids for each cluster (as averages of all genes in the cluster).
  4. Repeat the gene assignments and centroid calculations until no change in gene assignment occurs between iterations.

Distance Measures. You will use two distance measures for clustering: Euclidean distance and Spearman rank correlation. Euclidean distance measures differences in the absolute levels of gene expression, whereas correlation compares the shapes of expression patterns. Furthermore, Spearman rank correlation uses ranks in place of absolute values, which makes it less sensitive to outliers (extremely high or low values) in the data.

Euclidean distance between two vectors x and y of equal length l can be computed as follows:

Spearman rank correlation between two vectors x and y of equal length l can be computed as follows:

where di is the difference between the corresponding ranks of the two vectors.

For example, suppose we have two vectors x = [4.6, 1.9, 2.4] and y = [2.2, 6.6, 5.5]. Then the rank transforms of these two vectors would be r = [3, 1, 2] and s = [1, 3, 2], making d = [3-1, 1-3, 2-2] = [2, -2, 0].

Note that these functions should return distance measures; that is, the returned value should be high if the two vectors are dissimilar, low if they are similar, and zero if they are completely identical. This requirement is already met for Euclidean distance, but Spearman rank correlation varies between -1 and 1, and high values indicate similarity. Therefore, you must transform the Spearman rank correlation so that the returned value is always greater than or equal to zero, with high values indicating dissimilarity.

Overview of the code. The code is divided into several classes to represent our real-world data. These classes are Gene.java, Cluster.java, and KMeans.java. We have provided you with a start on each of these classes. As you read the following descriptions, refer to the relevant code files to get a sense of the structure of the program and to figure out the code we provided.

We're keeping the Gene objects in each Cluster in a HashSet<Gene>. The size of our clusters will change as we add genes to the cluster with nearest centroid. The generic class HashSet is part of the Java language specification (i.e. the standard libraries). A HashSet is an unordered set storing only unique elements, which means no duplicate elements can be present in the hash set. To enforce this uniqueness property, HashSet requires the containing member type to define hashCode() and equals(Object o) for checking equivalence of elements in the set. Once these are defined, inserting an element that is already present in the HashSet should have no effect.

All standard Java data types (such as Integer, String) already have equals() and hashCode() defined, but not for user-defined classes such as Gene. For this assignment, you will need to define hashCode() and equals(Object o) for type Gene. Two Gene objects are considered equal if their gene names are equal. hashCode() returns an int that is the index of the "bin" a Gene object should be hashed to. You can assume hashCode() for Gene works in the same way as how hashCode() for gene name (of type String) works.

Another method of HashSet that you want to be familiar with is add(Gene o). The add(Gene o) method will take a Gene as an argument and add it to the hash set.

Iterating through a HashSet<Gene> can be done using:

for (Gene gene : geneSet) {
    System.out.println(gene.getName());
}
The full specification is available at http://download.oracle.com/javase/6/docs/api/java/util/HashSet.html.

We use the HashSet data structure to represent genes in a cluster, rather than array, for the reason of efficient equality checking between clusters, because the equality can be checked by finding the intersections of two HashSets. If the size of intersection is equal to the size of cluster, then the two clusters are equal. At each iteration, you need to check for equality between the cluster before iteration and the cluster after iteration, for every cluster in the list, and terminate the iterative algorithm only if all clusters do not change. You can check for equality of two sets by: s1.containsAll(s2) && s2.containsAll(s1).

Since this should be a general program that can cluster multiple microarray files with varying values of K and different distance metrics, we're going to use command line parameters to specify these values at runtime. The command format to run this program will be the following:

java KMeans <input_data_filename> <K> <metric> <output_filename>
where: We are providing a constructor of KMeans that loads in the data from a specified filename. You just have to parse in the four command line arguments from the usual args array.

Progress Steps and Testing. Once you know that the Genes are loading correctly, you should start to cluster. Initialize each of the cluster centroids to a random Gene as a starting value and proceed with the algorithm described above. After each iteration of creating new clusters, you need to compare the cluster content before and after the iteration to determine if you need to continue with another iteration. In order to test and debug this process, it may be useful to take one of the provided data files and considerably reduce the number of genes and/or experiments. Then you can use this smaller data file and print out information at each step of the iterative process to verify that the proper behavior is occurring.

Once you know that your program is clustering properly, have it write to the screen the names of each gene in each cluster, and create a .jpg for each cluster by calling createJPG(), passing in the <output_filename> and the cluster number as the arguments. In the end, when your code is invoked, it should read in a data file, perform the clustering, list the gene names on the screen, and create pictures of the expression levels of the genes in each cluster.

We have marked the places in the code where you have to fill in with // TODO.

Using your code. Now you are ready to use your program to cluster some real data. We have provided two microarray data files. One (human_cancer.pcl) is a human lung cancer data set, and the other (yeast_stress.pcl) is a yeast stress response data set. Run your code on both data sets with varying Ks using each distance metric. Inspect the results, and see what values of K and which distance metric are performing the best. We want you to turn in results (a text file with the gene names in each cluster and .jpgs of each cluster) with K = 5 for the cancer dataset and K = 7 for the yeast dataset using both distance metrics. Name the text files dataset_metric_K.txt, e.g., human_euclidean_5.txt. Name the cluster image files dataset_metric_K_clusternumber.jpg, e.g., yeast_spearman_7_1.jpg.

Initialization When initializing the kmeans algorithm, there're two methods interms of how to choose the intial centroids. You need two write both of these methods in your code:

  1. Randomly choose K genes as the initial centroids
  2. Use the first K genes in the dataset as the the initial centroids
When generating the graphs and text files for submission, please only use method #2 (first K genes as initial centroids) and comment out method #1 in your code. When testing your code on your computer, we encourage you to try both methods and see how the clustering results change, so that you have a sense of noisy biological data and how kmeans clustering depends on initial values.

Extra credit. If you'd like to get some extra credit (and see more of how microarray data is being analyzed in research labs today), read on. One problem with the clusters of genes you identified is that it is unclear which of the clusters is "better" (i.e. more biologically correct). Is Spearman or Euclidian-based clustering better?

We can try to evaluate the clusters by looking at the biological functions of known genes in each cluster and checking for statistical enrichment of specific biological functions. If a cluster has high statistical enrichment for genes with particular function, does that make you more or less confident in the fact that this cluster is biologically relevant?

For extra credit, examine one of your clusters for functional enrichment. Pick a cluster from the yeast dataset that changed a lot between the two distance metrics. Put the genes from the two versions of that cluster into the GoTermFinder tool at http://go.princeton.edu/cgi-bin/GOTermFinder. Make sure to check the Process option. You can ignore the other fields. This tool identifies significantly enriched functions in a group of genes. It uses the Gene Ontology, a vocabulary for known biological functions assigned to genes. Based on the results from GoTermFinder, fill out the appropriate section of the readme.txt template.

Required files to submit: