04-11
Making Big Data Analytics Interactive and Real-Time

The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early frameworks, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning and graph processing), and (3) real-time processing on large data streams. In this talk, we present a single abstraction, resilient distributed datasets (RDDs), that supports all of these emerging workloads by providing efficient and fault-tolerant in-memory data sharing. We have used RDDs to build a stack of computing systems including the Spark parallel engine, Shark SQL processor, and Spark Streaming engine. Spark and Shark can run machine learning algorithms and interactive queries up to 100x faster than Hadoop MapReduce, while Spark Streaming enables fault-tolerant stream processing at significantly higher scales than were possible before. These systems, along with several resource allocation and scheduling algorithms we have developed along the way, have been used in multiple industry and research applications, and have a growing open source community with 14 companies contributing in the past year.

Matei Zaharia is a PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in systems, cloud computing and networking. He is also a committer on Apache Mesos and Apache Hadoop. His work is supported by a Google PhD fellowship. Matei got his undergraduate degree at the University of Waterloo in Canada.

Date and Time

Thursday April 11, 2013 4:30pm - 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

CS Department Colloquium Series

Speaker

Matei Zaharia, from University of California, Berkeley

Host

Michael Freedman

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

04-11 Making Big Data Analytics Interactive and Real-Time

04-11
Making Big Data Analytics Interactive and Real-Time