My research focuses on designing scalable, reliable and manageable distributed systems, which include topics on resource management in big-data analytics and machine learning, fault-tolerant network architecture, and strongly consistent and ultrafast key-value service using programmable networks (P4).
Previously, I interned at Facebook (summer 2017) and Microsoft Research (summer 2015). Before coming to Princeton, I received my B.Sc. in Computer Science from School of Electronics Engineering and Computer Science at Peking University in Beijing. I also received my B.A. in Economics (double major) from National School of Development.
Email: haoyuz (at) cs (dot) princeton (dot) edu
Teaching Assistant, 09/2014--05/2015
CS 217 Introduction to Programming Systems
Research Exchange Student, 09/2012--02/2013
Advisor: Daniel A. Freedman
Undergrad Member, 09/2011--09/2012
Advisor: Zhen Xiao
Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, Michael J. Freedman
Shuffle operations become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs, due to the superlinear increase in disk I/O operations as data volume increases. Riffle is an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. Using Riffle, Facebook's production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.
European Conference on Computer Systems (EuroSys '18). Porto, Portugal
Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soule, Changhoon Kim, Ion Stoica
Coordination services are a fundamental building block of modern cloud systems. NetChain provides scale-free sub-RTT coordination, by exploiting recent advances in programmable switches to store data and process queries entirely in the network data plane. Evaluation results show that our systems provides orders of magnitude higher throughput and lower latency, and handles failures gracefully.
USENIX Symposium on Networked Systems Design and Implementation (NSDI '18). Renton, WA, USA
Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman
Training machine learning models with large datasets can incur significant resource contention on shared clusters. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. SLAQ is a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. Experiments show that SLAQ achieves a quality improvement of up to 73% and a delay reduction of up to 44%.
ACM Symposium on Cloud Computing (SoCC '17). Santa Clara, CA, USA
Best Paper Award
Poster at the 1st SysML Conference (SysML '18). Stanford, CA, USA
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soule, Nate Foster, Jeongkeun Lee, Changhoon Kim, Ion Stoica
NetCache is a new key-value store architecture that leverages the power and flexibility of P4 switches to handle queries on hot items and balance the load across storage nodes. The core is a packet-processing pipeline that exploits the capabilities of modern programmable switch ASICs to efficiently detect, index, cache and serve hot key-value items in the switch data plane. We implement a NetCache prototype on Barefoot Tofino switches and demonstrate that a single switch can process 2+ billion queries per second.
ACM Symposium on Operating Systems Principles (SOSP '17). Shanghai, China
Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, Michael J. Freedman
VideoStorm is a video analytics system that processes thousands of vision analytics queries on live video streams over large clusters. An offline profiler generates query resource-quality profile, and an online scheduler allocates resources to queries to maximize video processing performance. Deployment on an Azure cluster of a hundred machines shows improvement by as much as 80% in quality of real-world queries and 7x better lag, processing video from operational traffic cameras.
USENIX Symposium on Networked Systems Design and Implementation (NSDI '17). Boston, MA, USA
Naga Katta, Haoyu Zhang, Michael J. Freedman, Jennifer Rexford
Ravana is a fault-tolerant SDN controller platform that processes the control messages transactionally and exactly once at both the controllers and the switches. The protocol guarantees strong consistency across controller replicas during controller and switch failures, by extending replicated state machines with lightweight switch-side mechanisms. Ravana enables unmodified controller applications to execute in fault-tolerant fashion.
ACM Symposium on SDN Research (SOSR '15). Santa Clara, CA, USA
I served as Publicity Chair for Association of Chinese Students and Scholars at Princeton University (ACSSPU) and Princeton Association of Chinese Entrepreneurs (PACE) in 2014--2015.
Photos on this page:
|Email:||haoyuz (at) cs (dot) princeton (dot) edu|
|Phone:||(609) 258 [tu: θri: nain eit]|
|Address:||Department of Computer Science|
|35 Olden Street, #318B|
|Princeton, NJ 08540-5233|
|Find me on:|