Princeton University
Computer Science Department

Computer Science 598F
Systems and Analytics of Big Data

 

Spring 2016

 

 


Directory
General Information | Syllabus | Projects


Some Systems Project Ideas

  1. Design alternatives for large-scale photo stores

    The goal of this potential project is to evaluate one or two design alternatives to Facebook's Haystack photo store.  In order to balance the workloads at Facebook, Haystack photo store distributes photos across storage subsystems evenly.  As a result, the system does not provides spatial locality for photos in the same albums.  One possibility is to protend that you are the chief architect for the new generation photo store at Facebook.  Your job is to come up with a new design that has the property to provide load balancing as well as spatial locality.  Since it is difficult to obtain real photo access traces, the participants of this project will need to model access patterns and create synthetic workloads based on current understanding of the access patterns and evaluate your design alternatives using the synthetic workloads.
  2. A file system that allows dynamically attaching and detaching storage devices

    The motivation of this project is to design a file system to deal with the problem that network bandwidth improves at the rate of 10x every decade, whereas data growth rate is on the Moore's law curve which is at the rate of 10x every 5 years.  Current cloud companies provide services for users to use portable medias by using postal services. For example, here is a quote from Amazon: "AWS Import/Export accelerates moving large amounts of data into and out of the AWS cloud using portal storage devices for transport.  AWS Import/Expert transfer your data directly onto and off the storage devices using Amazon's high-speed internal network and bypassing the Internet.  For significan datasets, AWS Import/Export is often faster than Internet ttransfer and more cost effective than upgrading your connectivity."  This is a semi-automatic but inefficient process.  What we want is to design a new file system that allows dynamically attach and detach storage devices (such as disk drives), but we will need to design this carefully to ensure security and privacy.    
  3. Tradeoffs of MPI and MapReduce paradigms

    This project investigates what MPI  tradeoffs between MapReduce is a parallel programming paradigm to allow fine-grained fault tolerence and it has been widely used for many parallel applications in the cloud.   MPI is a popular message-passing programming model for the past two decades in the HPC community, but it typically provides coarse-grained fault tolerance such as checkpoint / recovery.  Can MPI programs be written using MapReduce and perform well if we consider the overheads of both kinds of fault tolerance mechanisms?  This project can explore tradeoffs among performance, fault tolerence, and ease of programming with some kernels.
  4. Convolution networks for high-dimensional data

    Convolution networks or deep learning has gained a lot of traction since Hinton's group won the ImageNet challenge by a large margin in 2012.  But, convolution networks require a lot of computations for training.  Much of the deep learning studies have focused on 1D or 2D data such as images, natural language processing, and videos.  In neuroscience studies, human brain datasets are 3D volumes and fMRI datasets are 4D including the time dimension.  If we want to use convolution networks with such datasets, an important question is to figure out how to achieve a fast implementation.  This project can explore fast algorithms as well as implementations on modern hardware such as manycore CPU, GPU and FPGA.
  5. Deduplication file system for DNA sequence data

    Genomic data is already big and will become much bigger in the future.  An interesting property of human genome is that the difference between the genome sequences of two people is very small (about 1%).  Another property is that the human genome contains about 3 billion base paris.    The two properties motivate a possible design and implementation of a special-purpose, highly compressed file system to store human DNA sequence data.

Some Analytics Project Ideas

  1. Comparing different real-time fMRI data analysis approaches

    Neuroscientists have successfully used multi-voxel pattern analysis (MVPA) approach to perform real-time fMRI data analysis of closed-loop training of human attentions in a recent study.  Another fMRI data analysis approach is full-correlation matrix analysis (FCMA) which requires more computing but can analyze interactions among fMRI brain voxels.  A small project is to apply FCMA data analysis on the fMRI datasets recorded in the recent MVPA closed-loop study to answer some interesting neuroscience questions such as whether FCMA analysis yields the same results as MVPA in terms of human subjects' attention states and what interactions between voxels for each attention state.

  2. Generating activity networks from full correlation studies of fMRI datasets

    Full correlation studies of fMRI data can reveal the interactions among regions of a human brain.  By recording the links of such interaction activities,  we can create an activity network of human brains in a particular dataset.   The goal of this project is to create and study the activity networks from about 20 available datasets at Princeton Neuroscience Institute.  To achieve this goal, we can use existing FCMA tookbox.  A simple idea to study the activity networks is to visualize or analyze them based on tasks, time, datasets, human subjects and so on.
  3. ISBI Challenge: Segmentation of neuronal structures in EM stacks

    Details are at http://brainiac2.mit.edu/isbi_challenge/home

  4. 3D Segmentation of neurites in EM images

    Details are at http://brainiac2.mit.edu/SNEMI3D/home.