05-04
Rohan Baskar Prabhakar will present his FPO "Hardware-Aware Software Optimizations for and with Machine Learning" on Monday, May 4, 2026 at 10am in CS 302.

Rohan Baskar Prabhakar will present his FPO "Hardware-Aware Software Optimizations for and with Machine Learning" on Monday, May 4, 2026 at 10am in CS 302.
 
His committee is as follows:
Examiners: David Wentzlaff (adviser),  Aarti Gupta and Ravi Netravali
Readers: Jialin Ding and Prateek Mittal
 
Abstract:
In recent years, fundamental limits in semiconductor manufacturing have caused a gradual decline in the steady cadence of hardware performance scaling. With pivotal trends like Moore's law and Dennard scaling ending, there is a growing need to ensure that software workloads execute at peak efficiency. Towards this objective, this dissertation describes three hardware-aware optimizations that either accelerate machine learning inference or use machine learning to optimize single-threaded CPU workloads.
 
First, this dissertation presents Kraken, a variation of the Transformer architecture designed to improve the efficiency of tensor parallelism during inference. The new model architecture incorporates an innate notion of model parallelism that complements the topology of multi-device inference hardware and allows communication to overlap with compute. Experiments demonstrated that while preserving the language modeling performance of standard Transformers, the Kraken architecture improves Time To First Token by a geomean of 35.6% across a range of model configurations.
 
Second, the dissertation investigates the feasibility of integrating a binary classifier to increase the efficiency of the verification phase in speculative decoding. Although speculative decoding is effective in accelerating the decode step of Transformer inference, performance gains are limited to small batch sizes where kernels are memory-bound as opposed to compute-bound. Using n-gram matching as the draft method and intermediate activations from early Transformer layers as input allows binary classifiers to filter 75% of draft tokens that would otherwise be rejected. Doing so decreases the effective batch size of the verification step, expanding the range of scenarios in which speculative decoding is effective.
 
Finally, the dissertation introduces Toggle, a dynamic optimization system that enables single-threaded CPU programs to switch both compilers and optimization choices at runtime. Relying on the premise that the best choice of compiler and optimizations is a function of the current program phase and input, the system uses otherwise idle inference accelerators and statistics from hardware performance counters to perform continuous optimization. When evaluated on the SPEC CPU 2017 benchmark suite, integrating Toggle improved program runtime by a geomean of 4.32%, effectively extracting a year's worth of hardware performance gains.

 
Date and Time
Monday May 4, 2026 10:00am - 12:00pm
Location
Computer Science 302
Event Type
Speaker
Rohan Baskar Prabhakar
Host
Rohan Baskar Prabhakar

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List