UTILITY SCHEDULING FOR MULTI-TENANT CLUSTERS | Computer Science Department at Princeton University

Report ID:

TR-005-19

Authors:

Stafman, Logan

Date:

May 10, 2019

Pages:

113

Download Formats:

[PDF]

Abstract:

The rapid increase in data size along with the complex patterns of data usage amongst
data scientists presents new challenges for large-scale data analytics systems. Modern distributed computing frameworks must support complex applications that range from answering database queries to training machine learning models. As data centers have grown,
managing their resources has become an increasingly important task. New applications
have become popular that make traditional scheduling systems inadequate.
In this thesis, we present distributed scheduling systems aimed at increasing cluster
resource utilization by taking advantage of specific characteristics of data processing applications. First, we identify a set of applications whose characteristics make them prime
targets for utility-based scheduling. We then focus on two specific types of these applications in the following systems:
(i) SLAQ: a cluster scheduling system for machine learning (ML) training jobs that
aims to maximize the qualities of all models trained. In exploratory model training, models
can be improved more quickly by redirecting resources to jobs with the highest potential for
improvement. SLAQ reduces latency and maximizes the quality of models being trained
by a distributed ML cluster.
(ii) ReLAQS: a cluster scheduling system for incremental approximate query processing (AQP) systems that aims to minimize the error of all approximate results. In AQP,
queries compute approximate results by sampling data. In AQP, error can be reduced more
quickly by allocating resources to queries with higher error. ReLAQS reduces the latency
required to reach a query result with a given level of error in a shared AQP environment.
These works demonstrate a novel set of methods that can be used in fine-grained
scheduling to build responsive, efficient distributed systems. We have evaluated these
systems on standard benchmark workloads and datasets, as well as popular ML algorithms,
and show both reduced latency and increased accuracy of intermediary results.