I am a 5th-year Ph.D. student in Department of Computer Science at Princeton University. My advisor is Professor Michael J. Freedman.

My research focuses on designing scalable, reliable and manageable distributed systems, which include topics on resource management in big-data analytics and machine learning, fault-tolerant network architecture, and strongly consistent and ultrafast key-value service using programmable networks (P4).

Previously, I interned at Facebook (summer 2017) and Microsoft Research (summer 2015). Before coming to Princeton, I received my B.Sc. in Computer Science from School of Electronics Engineering and Computer Science at Peking University in Beijing. I also received my B.A. in Economics (double major) from National School of Development.

Email: haoyuz (at) cs (dot) princeton (dot) edu

[Bio] [CV]

My  Portrait.

Princeton University, S* Network Systems (SNS) Group

Research Assistant, 09/2013--Present
Advisor: Mike Freedman

Facebook, Menlo Park

Ph.D. Software Engineer Intern, 06/2017--08/2017
Mentors: Brian Cho, Ergin Seyfe

Microsoft Research, Redmond

Research Intern, 06/2015--08/2015
Mentors: Ganesh Ananthanarayanan, Peter Bodik

Princeton Computer Science Department

Teaching Assistant, 09/2014--05/2015
CS 217 Introduction to Programming Systems

Technion---Israel Institute of Technology

Research Exchange Student, 09/2012--02/2013
Advisor: Daniel A. Freedman

Peking University, Institute of Network Computing and Information Systems

Undergrad Member, 09/2011--09/2012
Advisor: Zhen Xiao

Riffle: Optimized Shuffle Service for Large-Scale Data Analytics [PDF] [Slides] [BibTeX]

Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, Michael J. Freedman

Shuffle operations become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs, due to the superlinear increase in disk I/O operations as data volume increases. Riffle is an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. Using Riffle, Facebook's production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.

European Conference on Computer Systems (EuroSys '18). Porto, Portugal



NetChain: Scale-Free Sub-RTT Coordination [PDF] [BibTeX]

Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soule, Changhoon Kim, Ion Stoica

Coordination services are a fundamental building block of modern cloud systems. NetChain provides scale-free sub-RTT coordination, by exploiting recent advances in programmable switches to store data and process queries entirely in the network data plane. Evaluation results show that our systems provides orders of magnitude higher throughput and lower latency, and handles failures gracefully.

USENIX Symposium on Networked Systems Design and Implementation (NSDI '18). Renton, WA, USA

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning [PDF] [Slides] [BibTeX]

Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman

Training machine learning models with large datasets can incur significant resource contention on shared clusters. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. SLAQ is a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. Experiments show that SLAQ achieves a quality improvement of up to 73% and a delay reduction of up to 44%.

ACM Symposium on Cloud Computing (SoCC '17). Santa Clara, CA, USA

Best Paper Award

Poster at the 1st SysML Conference (SysML '18). Stanford, CA, USA

SLAQ Scheduler


NetCache: Balancing Key-Value Stores with Fast In-Network Caching [PDF] [BibTeX]

Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soule, Nate Foster, Jeongkeun Lee, Changhoon Kim, Ion Stoica

NetCache is a new key-value store architecture that leverages the power and flexibility of P4 switches to handle queries on hot items and balance the load across storage nodes. The core is a packet-processing pipeline that exploits the capabilities of modern programmable switch ASICs to efficiently detect, index, cache and serve hot key-value items in the switch data plane. We implement a NetCache prototype on Barefoot Tofino switches and demonstrate that a single switch can process 2+ billion queries per second.

ACM Symposium on Operating Systems Principles (SOSP '17). Shanghai, China

Live Video Analytics at Scale with Approximation and Delay-Tolerance [PDF] [Slides] [BibTeX]

Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, Michael J. Freedman

VideoStorm is a video analytics system that processes thousands of vision analytics queries on live video streams over large clusters. An offline profiler generates query resource-quality profile, and an online scheduler allocates resources to queries to maximize video processing performance. Deployment on an Azure cluster of a hundred machines shows improvement by as much as 80% in quality of real-world queries and 7x better lag, processing video from operational traffic cameras.

USENIX Symposium on Networked Systems Design and Implementation (NSDI '17). Boston, MA, USA

Distributed Video Analytics

Ravana Fault-Tolerant Controller

Ravana: Controller Fault-Tolerance in Software-Defined Networking [PDF] [BibTeX]

Naga Katta, Haoyu Zhang, Michael J. Freedman, Jennifer Rexford

Ravana is a fault-tolerant SDN controller platform that processes the control messages transactionally and exactly once at both the controllers and the switches. The protocol guarantees strong consistency across controller replicas during controller and switch failures, by extending replicated state machines with lightweight switch-side mechanisms. Ravana enables unmodified controller applications to execute in fault-tolerant fashion.

ACM Symposium on SDN Research (SOSR '15). Santa Clara, CA, USA

I am a Chinese calligraphy enthusiast, and enjoy handwriting with ink brushes following Liu Gongquan's Style. I like magic, and was once in PKU Magicians' Club when I was an undergrad.

I served as Publicity Chair for Association of Chinese Students and Scholars at Princeton University (ACSSPU) and Princeton Association of Chinese Entrepreneurs (PACE) in 2014--2015.

Photos on this page:

QR Code
Email:haoyuz (at) cs (dot) princeton (dot) edu
Phone:(609) 258 [tu: θri: nain eit]
Address: Department of Computer Science
Princeton University
35 Olden Street, #318B
Princeton, NJ 08540-5233
Find me on: