
|
Computer Science 598F
Systems and Analytics of Big Data
|
Spring 2016
|
Directory
General Information | Syllabus |
Projects
Some Systems Project Ideas
-
Design alternatives for large-scale photo stores
The goal of this potential project is to evaluate one or two
design alternatives to Facebook's Haystack photo store.
In order to balance the workloads at Facebook, Haystack photo
store distributes photos across storage subsystems
evenly. As a result, the system does not provides
spatial locality for photos in the same albums. One
possibility is to protend that you are the chief architect for
the new generation photo store at Facebook. Your job is
to come up with a new design that has the property to provide
load balancing as well as spatial locality. Since it is
difficult to obtain real photo access traces, the participants
of this project will need to model access patterns and create
synthetic workloads based on current understanding of the
access patterns and evaluate your design alternatives using
the synthetic workloads.
-
A file system that allows dynamically attaching and
detaching storage devices
The motivation of this project is to design a file system to
deal with the problem that network bandwidth improves at the
rate of 10x every decade, whereas data growth rate is on the
Moore's law curve which is at the rate of 10x every 5
years. Current cloud companies provide services for
users to use portable medias by using postal services. For
example, here is a quote from Amazon: "AWS Import/Export
accelerates moving large amounts of data into and out of the
AWS cloud using portal storage devices for transport.
AWS Import/Expert transfer your data directly onto and off
the storage devices using Amazon's high-speed internal
network and bypassing the Internet. For significan
datasets, AWS Import/Export is often faster than Internet
ttransfer and more cost effective than upgrading your
connectivity." This is a semi-automatic but
inefficient process. What we want is to design a new
file system that allows dynamically attach and detach storage
devices (such as disk drives), but we will need to design this
carefully to ensure security and privacy.
-
Tradeoffs of MPI and MapReduce paradigms
This project investigates what MPI tradeoffs between
MapReduce is a parallel programming paradigm to allow
fine-grained fault tolerence and it has been widely used for
many parallel applications in the cloud. MPI is a
popular message-passing programming model for the past two
decades in the HPC community, but it typically provides
coarse-grained fault tolerance such as checkpoint /
recovery. Can MPI programs be written using MapReduce
and perform well if we consider the overheads of both kinds of
fault tolerance mechanisms? This project can explore
tradeoffs among performance, fault tolerence, and ease of
programming with some kernels.
-
Convolution networks for high-dimensional data
Convolution networks or deep learning has gained a lot of
traction since Hinton's group won the ImageNet challenge by a
large margin in 2012. But, convolution networks require
a lot of computations for training. Much of the deep
learning studies have focused on 1D or 2D data such as images,
natural language processing, and videos. In neuroscience
studies, human brain datasets are 3D volumes and fMRI datasets
are 4D including the time dimension. If we want to use
convolution networks with such datasets, an important question
is to figure out how to achieve a fast implementation.
This project can explore fast algorithms as well as
implementations on modern hardware such as manycore CPU, GPU
and FPGA.
-
Deduplication file system for DNA sequence data
Genomic data is already big and will become much bigger in the
future. An interesting property of human genome is that
the difference between the genome sequences of two people is
very small (about 1%). Another property is that the
human genome contains about 3 billion base paris.
The two properties motivate a possible design and
implementation of a special-purpose, highly compressed file
system to store human DNA sequence data.
Some Analytics Project Ideas
- Comparing different real-time fMRI data analysis
approaches
Neuroscientists have successfully
used multi-voxel pattern analysis (MVPA) approach to
perform real-time fMRI data analysis of closed-loop
training of human attentions in a recent study.
Another fMRI data analysis approach is
full-correlation matrix analysis (FCMA) which requires
more computing but can analyze interactions among fMRI
brain voxels. A small project is to apply FCMA
data analysis on the fMRI datasets recorded in the
recent MVPA closed-loop study to answer some
interesting neuroscience questions such as whether
FCMA analysis yields the same results as MVPA in terms
of human subjects' attention states and what
interactions between voxels for each attention state.
- Generating activity networks from full correlation
studies of fMRI datasets
Full correlation studies of fMRI data can reveal the
interactions among regions of a human brain. By
recording the links of such interaction activities, we
can create an activity network of human brains in a
particular dataset. The goal of this project is
to create and study the activity networks from about 20
available datasets at Princeton Neuroscience
Institute. To achieve this goal, we can use existing
FCMA tookbox. A simple idea to study the activity
networks is to visualize or analyze them based on tasks,
time, datasets, human subjects and so on.
-
ISBI Challenge: Segmentation of neuronal structures
in EM stacks
Details are at
http://brainiac2.mit.edu/isbi_challenge/home
-
3D Segmentation of neurites in EM images
Details are at http://brainiac2.mit.edu/SNEMI3D/home.
-