Princeton University
Computer Science Department

Computer Science 493
Algorithms for Massive Data Sets

Moses Charikar

Spring 2002


General Information | Schedule and Readings | What's New?

Course Description

The course deals with algorithmic tools and techniques for the organization, manipulation and processing of large amounts of data. With the current information explosion, several applications require the manipulation of very large data sets that exceed the size of the main memory of a typical processor. In some cases, the amount of data is so large that we may only make a few passes over it, or cannot examine all of it. Certain problems that arise in this setting have been studied classically, e.g. How do we compress large files ? How do we deal with bit errors that are inevitable in the storage and transmission of large amounts of data ? On the other hand, recent research has examined many of the newer issues that arise in this context, e.g. Can we design very efficient sublinear algorithms for interesting computational tasks ? What can be computed in a single pass over the data ? How can we deal with high dimensional data efficiently ? How do we deal with web data, e.g. assess the relative importance of web pages ? The course will cover a combination of classical topics as well as a selection of some of the interesting ideas that have arisen in recent research. 

The first part covers classical topics such as

compression (Huffman and arithmetic coding, Lempel-Ziv, Burrows Wheeler),
error-correcting codes (Reed Solomon and Tornado codes), 
indexing (inverted files, signature files, bitmaps), 
clustering, classification and learning (naive Bayes, support vector machines, k-means). 

The second part covers new algorithms and techniques such as 

streaming algorithms, 
sketching algorithms for estimating similarity, 
dimension reduction, 
nearest neighbor search, 
link analysis (Pagerank, hubs and authorities), 
singular value decomposition, and 
models for the web.

Students will be expected to have a basic knowledge of algorithms and discrete mathematics. 
Prerequisites: COS 226 and COS 341 or permission of instructor. Recommended: COS 423


Problem sets will be handed out in class every two weeks. All students will be expected to scribe lecture notes for one or two classes during the semester. The grade will be based on performance in the problem sets, quality of the scribe notes and class participation. There will be no midterm or final exam.

Syllabus (tentative)

Week 1: Compression: Models, Information Theory, Huffman coding and Arithmetic coding

Week 2: Lempel-Ziv algorithms, Burrows Wheeler, Lossy compression

Week 3: Indexing: inverted file compression and models, signature files and bitmaps

Week 4: Error correcting codes, Reed Solomon codes

Week 5: Decoding Reed Solomon codes, Tornado codes

Week 6: Classification and clustering: Naive Bayes, support vector machines and k-means.

Week 7: Streaming Algorithms: reservoir sampling, estimating distinct values and frequency moments

Week 8: Sketching algorithms for estimating similarity and applications to clustering.

Week 9: Dimension reduction and nearest neighbor algorithms.

Week 10: Link analysis: Pagerank and Kleinberg's Hubs and Authorities

Week 11: Singular Value Decomposition and models for the web.

Week 12: Web characterization, power laws and research problems.


Administrative Information

Lectures: TTh 1:30-2:50 , Room: 302  

Professor: Moses Charikar - 305 CS Building - 258-7477

Undergraduate Coordinator: Tina McCoy - 410 CS Building - 258-1746

Course Secretary: Mitra Kelly - 323 CS Building - 258-4562

Mailing List: send mail to with "subscribe cs493" in the message body to subscribe.

Teaching Assistants: TBA


Last updated by Moses Charikar, 07-Feb-2003 12:57 AM