A Parallel Out-of-Core K-Means Clusterer
This is a re-implemtation of the K-means clustering algorithm. I hate to
re-implement the wheel, but the freely available K-means clusterers I can find
online all lack this or that feature that I think might be essential to
allowing me to cluster my 50GB dataset into 1 million clusters within a few
days. After failing to figure out a clean way to run Mahout on my dataset, I decided to
give up searching the web and write my own cluster. Following are a few
features that might make you interested to give my clusterer a try (although I
don't think they really justify a re-implementation because
both parallization and out-of-core are
only needed at the same time when the dataset is too large to be clustered):
- It's implemented in C with low memory overhead.
- It allows/requires you to specify the amount of memory to use.
- and does out-of-core processing when the memory provided is smaller than the dataset.
- It does parallel reading from multiple files to improve throughput if they reside on multiple independent devices.
- It does pthread-based multi-threading and has virtually no software dependency other than the C compiler.
- However, if you do have MPI, it can be used to enable clustering (the other sense).
- and if you do have a CBLAS library like ATLAS or
Intel MKL, it can be used to make L2 distance
calculation 5 times faster.
- It has an embedded kd-tree implementation to accelerate cluster center search when K is large (1,000,000), although it can not be used with CBLAS at the same time.
You can download the source code here; there are some instructions on compiling and
running in the same file. Feel free to drop me an
email if you encounter some problem or simply to say that the program works.