Princeton University
Computer Science Department

Computer Science 598C
Analytics and Systems of Big Data

 

Spring 2013

 

 


Directory
General Information | Syllabus | Projects



Tentative Syllabus

(You can find all presentations on this course’s blackboard)

 

Week 1

2/4: Organizational meet  (digital universe)

2/8: MapReduce, datasets and project ideas, (Google paper, Dean’s keynote,

Chapter 2 of MMDS, critic blog which has been removed)

 

Warm-up exercise:

Follow these instructions to access the C8 cluster.  Modify the given WordCount example (source code is at HDFS:/user/fuse/WordCount.java) such that your code will clean up the corpus  documents (HDFS:/user/fuse/essays) and output the frequencies of words in a sorted order.  Your final program will be submitted to department dropbox (we will setup shortly).  Due: 2/22.

 

Week 2  (2/15):  Deep learning and GFS

Linpeng Tang (reading review,  secondary high-level structure, ICLM’12)

Sharvanath Pathak (reading GFS, SOSP’03, secondary Chubby, OSDI’06)

submit reading nodes

 

Week 3 (2/22):

Guest lecture by Dr. Phillip Shilane from EMC (reading acceleration for backup, FAST’12)

Brian Tubergen (reading protocol independent dedup, SIGCOMM’00, secondary network dedup acceleration)

submit reading nodes

submit warmup exercise

 

2/24

Submit project proposals

 

Week 4 (3/1):

Guest lecture by Dr. Ruoming Pang from Google (reading Google’s globally distributed DB, OSDI’12)

Mehmet Basbug (primary: LSH (VLDB’99), secondary Muti-probe (VLDB’07)).

Guest: Dr. William Hanson (CMO of UPenn Hospital) on medical data and projects

Submit reading notes

 

Week 5 (3/8):

Madhuvanthi Jayakumar (Primary: Google’s globally distributed storage OSDI’10)

Nayden Nedev (Primary: Google Data center network, CACM’12,  SDN)

Christian Edbank (Primary: KNN-Decent, WWW’11, Secondary: 7.1-7.3 of MMDS,)

Submit reading notes

 

Week 6 (3/15)

Akshay Mittal (primary: Ramcloud, secondary: Spark)

Srinivas Narayana (primary: MMDS 4.1-4.4, secondary Opensketch)

Andrew Werner (Primary: Spanner OSDI’12, secondary Paxos)

Submit reading notes

 

Week 7: (3/29)

Guest lecture by Dr. Sanjeev Kumar from Facebook (reading Facebook’s photo storage, OSDI’10)

Xiao Li (Primary: Amazon’s key-value store SOSP’07, secondary SILT SOSP’11).

Submit reading notes

                            

Project mid-term progress reports and short presentations, submit

 

Week 8 (4/5)

Eric First (Primary: MMDS 3.5, Secondary: graph similarity)

Wathsala Wathawana (Primary: layered naming, Secondary: Chord)

Muneeb Ali (Primary: Kahn’s paper, Secondary: Lampson’s paper)

Submit reading notes

 

Week 9 (4/12)

Mike Mckeown (Primary: energy-proportional computing; Secondary: energy-efficient mapreduce)

Trevor Bannard (Primary: ImageNet construction; Secondary: classification with 10,000 categories)

Robert Sami (Primary: 9.1-9.2 MMDS book; Secondary: 9.3 MMDS book)

Submit reading notes

 

Week 10 (4/19)

Sachin Ravi (Primary: external hashing, secondary: unpublished)

Marcela Melara (Primary: info leak in cloud; secondary: attack-resource)

Tri Nguyen (Primary: energy-proportional storage; secondary: manycore key-value store)

Submit reading notes

 

Week 11 (4/26)

Guest lecture by Prof. Moses Charikar (various topics)

Alp Kutlualp (primary: chapter 11.1 and 11.2, secondary: chapter 11.3http://i.stanford.edu/~ullman/pub/ch11.pdf)

Submit reading notes

 

Last class meet (5/17)

All: Project demos/presentations (each 15 minutes)

 

5/19

Project report due, Submit

 

 

 

Suggested book

 

Mining of Massive Data Sets.  Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman. Cambridge University Press. 2011.

You can download the latest book from an author’s webpage.

 

Systems Topics

           

MapReduce abstraction

Google paper, Dean’s keynote, Chapter 2 of MMDS, critic blog (removed but interesting for discussion)

 

Google systems

GFS (SOSP’03), BigTable, (OSDI’06), Google cluster(IEEE’03), Google Data center network (CACM’12),

 

Key-value stores

Amazon’s key-value store (SOSP’07),  SILT (SOSP’11)

 

Distributed storage systems

Facebook’s photo storage (OSDI’10),  Google’s globally distributed storage (OSDI’10), Google’s globally distributed DB (OSDI’12), Microsoft’s Azure storage (USENIX’12).

           

Deduplication storage systems

Venti (FAST’02), Data domain DDFS (FAST’08), others (TBD).

 

Deduplication WAN bandwidth optimization

WAN protocol independent dedup (SIGCOMM’00), LBFS (SOSP’01), WAN acceleration for backup (FAST’12), others (TBD).

 

Analytics topics

 

Domain specific feature extractions

Image SIFS (ICCV’99)), Audio MFCC ( ), others (TBD)

 

Unsupervised feature learning and deep learning

review,  high-level structure (ICLM’12), others (TBD)

 

Ontology

Imagenet (CVPR’09, ECCV’10), others (TBD)

 

Similarity measures

Section 3.5 of MMDS, papers (TBD)

 

Shingles and minhashing

Section 3.2-3.3 of MMDS, Fingerprinting, Document resemblance

 

Locality sensitive hashing

Section 3.4 of MMDS, LSH (VLDB’99), Muti-probe (VLDB’07), Posteriori Multi-probe, …

 

Dimension reduction

Curse of  Dimensionality (TBD), PCA (TBD), Sketches (TBD), others (TBD)

           

Streaming

Sampling (Section 4.1 and 4.2 of MMDS), Bloom filter (Section 4.3 of MMDS), Counting (section 4.4 of MMDS), others (TBD)

 

Processing large data in memory

Section 6.3 of MMDS

 

Clustering in high Dimensional Space

K-means (Section 7.3 of MMDS), KNN-Decent (WWW’11), others (TBD)

 

Web link analysis

Parts of Chapter 5 of MMDS

 

Graph search

Parts of chapter 10 of MMDS