05-04
Rohan Baskar Prabhakar FPO

Rohan Baskar Prabhakar will present his FPO "Hardware-Aware Software Optimizations for and with Machine Learning" on Monday, May 4, 2026 at 10am in CS 302.

His committee is as follows:

Examiners: David Wentzlaff (adviser), Aarti Gupta and Ravi Netravali

Readers: Jialin Ding and Prateek Mittal

Abstract:

In recent years, fundamental limits in semiconductor manufacturing have caused a gradual decline in the steady cadence of hardware performance scaling. With pivotal trends like Moore's law and Dennard scaling ending, there is a growing need to ensure that software workloads execute at peak efficiency. Towards this objective, this dissertation describes three hardware-aware optimizations that either accelerate machine learning inference or use machine learning to optimize single-threaded CPU workloads.

First, this dissertation presents Kraken, a variation of the Transformer architecture designed to improve the efficiency of tensor parallelism during inference. The new model architecture incorporates an innate notion of model parallelism that complements the topology of multi-device inference hardware and allows communication to overlap with compute. Experiments demonstrated that while preserving the language modeling performance of standard Transformers, the Kraken architecture improves Time To First Token by a geomean of 35.6% across a range of model configurations.

Second, the dissertation investigates the feasibility of integrating a binary classifier to increase the efficiency of the verification phase in speculative decoding. Although speculative decoding is effective in accelerating the decode step of Transformer inference, performance gains are limited to small batch sizes where kernels are memory-bound as opposed to compute-bound. Using n-gram matching as the draft method and intermediate activations from early Transformer layers as input allows binary classifiers to filter 75% of draft tokens that would otherwise be rejected. Doing so decreases the effective batch size of the verification step, expanding the range of scenarios in which speculative decoding is effective.

Finally, the dissertation introduces Toggle, a dynamic optimization system that enables single-threaded CPU programs to switch both compilers and optimization choices at runtime. Relying on the premise that the best choice of compiler and optimizations is a function of the current program phase and input, the system uses otherwise idle inference accelerators and statistics from hardware performance counters to perform continuous optimization. When evaluated on the SPEC CPU 2017 benchmark suite, integrating Toggle improved program runtime by a geomean of 4.32%, effectively extracting a year's worth of hardware performance gains.

Date and Time

Monday May 4, 2026 10:00am - 12:00pm

Location

Computer Science 302

Event Type

Final Public Oral

Speaker

Rohan Baskar Prabhakar

Host

Rohan Baskar Prabhakar

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

05-04 Rohan Baskar Prabhakar FPO

05-04
Rohan Baskar Prabhakar FPO