04-17
Jianan Lu FPO (194 Nassau Street)

Title: Domain-Aware Data Systems for Modern Analytics Abstract: Data systems form the critical backbone of modern analytics, enabling organizations to extract valuable insights from ever-growing volumes of data. Traditionally, analytical workloads centered on SQL-style queries over structured tables. More recently, the rise of AI and large language models has introduced a new class of semantic analytics, exemplified by vector search.  As the landscape of data analytics continues to evolve, existing data systems often deliver suboptimal performance–cost tradeoffs. This dissertation introduces a domain-aware co-design approach for data systems, achieving better performance–cost tradeoffs for both structured analytics and modern semantic workloads. The first system is Fusion, an object for analytics that is optimized for SQL-style query pushdown on erasure-coded data. Existing pushdown solutions on disaggregated storage are inefficient because analytics files get partitioned across storage nodes during erasure coding. Instead, Fusion co-designs its erasure coding and file placement topologies, taking into account popular analytics file formats (e.g., Parquet). It employs a novel stripe construction algorithm that prevents fragmentation of computable units within an object, and minimizes storage overhead during erasure coding. After generating such a pushdown-friendly data layout, Fusion adopts a fine-grained adaptive pushdown mechanism to achieve superior query performance.  The second system is Terminus, a graph-based vector search system with rank-aware early termination. Existing vector search systems waste significant I/O resources even after they have already discovered the highest-ranked (i.e., most valuable) results for downstream applications such as Retrieval-Augmented Generation (RAG). Terminus models per-I/O search utility using a rank-weighted function and terminates once recent I/Os yield negligible utility gains. By aligning I/O spending with application utility, Terminus achieves higher search throughput with minimal impact on application accuracy. To summarize, this dissertation advances domain-aware data system design and demonstrates its effectiveness through two systems: Fusion and Terminus. Committee Members: Mike J. Freedman (examiner) Wyatt Llyod (examiner) Mae Milano (examiner) Jialin Ding (reader) Asaf Cidon (reader)

Date and Time
Friday April 17, 2026 12:00pm - 2:00pm
Not yet determined.
Event Type

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List