Carlos Jimenez will present his FPO "Agent in the Shell: Establishing Agentic Software Engineering" on Tuesday, April 28, 2026 at 3:30 in Friend 110.
The members of Carlos’ committee are as follows:
Examiners: Karthik Narasimhan (Adviser), Danqi Chen, Sanjeev Arora
Readers: Karthik Narasimhan, H. Sebastian Seung
A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.
Everyone is invited to attend the talk.
Abstract follows below:
Neural language models were put to work on code almost as soon as they were useful for anything. Their capabilities have grown from auto-complete to chat interactions, and we establish a third paradigm, in which language model work as agents that autonomously complete full software engineering tasks. The first benchmarks for code language models focused on isolated function synthesis: a specification goes in, a function comes out. Real software engineering is a different activity; it requires directory navigation, running code, and editing files that already exist inside a larger system. We propose an evaluation task that measures progress using precise execution verification and demonstrate the benefits of the agentic approach, in which the model works on the repository interactively.
We first introduce SWE-bench, a benchmark drawn from real GitHub issues across twelve open-source Python repositories, each scored by whether a system's patch passes the repository's own tests. The strongest retrieval-augmented baselines resolve under 2% of the test set, establishing that single-shot generation is not enough.
We then present SWE-agent, an agent that gives a language model interactive access to the repository through the shell and an interface specially designed for software engineering agents. With model weights fixed, SWE-agent resolves 12.47% of SWE-bench, and ablations isolate the interface as an axis of progress separable from model scale.
We finally introduce SWE-bench Multimodal, which reapplies the collection pipeline to visual JavaScript repositories and shows that the task definition transfers across language and modality while systems tuned to Python largely do not.
In the years since its release, SWE-bench has been widely adopted as a core evaluation target for frontier models, and reported resolve rates have risen above 90% on a human-validated subset as of April 2026. We analyze the work that has followed and find that model improvement has far outpaced scaffold design: a minimal bash-only agent tracks the frontier within a few points, reflecting both raw capability gains and the tailoring of frontier models toward agentic software engineering as an explicit training target.