James Zhang wins Senior Thesis Prize from the Center for Digital Humanities

July 21, 2025
News Body
Image
James Zhang giving a presentation
James Zhang, a computer science major, presents his work at a conference hosted by the Center for Digital Humanities at Princeton University in May 2025. Photo by Carrie Ruddick

By the Princeton Center for Digital Humanities

Have a lot of character flaws — in your scanned documents, that is? James Zhang, winner of the Center for Digital Humanities Senior Thesis Prize, can help.

The computer science major from Basking Ridge, NJ, began doing research in the digital humanities as a junior when he took an independent work seminar led by Brian Kernighan, the William O. Baker *39 Professor in Computer Science and a member of the Center for Digital Humanities Executive Committee. In that class, Zhang started exploring the limitations of Optical Character Recognition (OCR), a technology that converts images of words, like a scanned document or a photo of pages in a book, into machine-readable text.

Zhang's junior project used large language models to reconstruct the original text from flawed OCR output, using the Princeton Prosody Archive (PPA), a database of thousands of digitized books in English published between 1532 and 1928. Building on this work for his senior thesis, Zhang expanded his approach by incorporating large vision-language models to analyze the visual elements of historical documents. The result is MetaScribe, an open-source tool he developed to help librarians and researchers get more from archives like the PPA.

His innovative work earned him several awards, including the CDH Senior Thesis Prize, the Computer Science Department Outstanding Senior Thesis Award, the School of Engineering Calvin Dodd McCracken Senior Thesis/Project Award and the Princeton University Library Award at the 2025 Princeton Research Day. 

In the interview below, the CDH asked Zhang about his thesis research and what draws him to projects in the digital humanities. Read the original story on the CDH website.

Tell us a little about your thesis. What problem did you identify, and how did you work toward a solution?

My thesis explores how large vision-language models (LVLMs) can support Optical Document Recognition (ODR) in libraries; specifically, how they can help extract and generate (richer) metadata from scanned documents. I built an open-source tool called MetaScribe that leverages LVLMs to extract information from scanned documents and create metadata based on customizable fields. It is designed to scaffold the metadata creation process, offering a flexible starting point that archives and libraries can adapt to their own needs. 

Image
James Zhang and Brian Kernighan
Zhang and his thesis adviser, Brian Kernighan, at the Department's Class Day celebration. Photo courtesy of James Zhang

How were you first introduced to the Center for Digital Humanities?

I took Professor Brian Kernighan’s independent work seminar on digital humanities, where Wouter Haverals, Perkins Fellow and Associate Research Scholar at the CDH, and Mary Naydan, Digital Humanities Project Manager, served as teaching assistants. Through them, I was introduced to the Princeton Prosody Archive and learned about the broader challenges the CDH faced. 

Wouter and Mary first pointed out that the Optical Character Recognition provided by vendors like Gale and HathiTrust was often subpar, which limited the archive’s usability. That semester, we tackled the post-OCR correction problem, asking: If all we had was the raw OCR text, could we reconstruct the correct transcription? We found that purely textual large language models could, surprisingly, perform this task quite well.

But through this process, we started questioning the framing: was “recognition” really the main problem? Browsing the PPA, with its visually rich and structurally complex pages, it became clear that even perfect OCR often flattens the very elements that give documents meaning. What we needed wasn’t just more accurate transcriptions: it was better representation.

That tension, between treating documents as plain text versus as visual objects imbued with semantic structure, became the heart of my thesis. Playing with the PPA’s search interface made the issue feel concrete: there was all this visual richness, but little way to search for it. It created what we felt like an “illusion of find.”

And so with the rise of LVLMs, new possibilities opened up. Their general capabilities, from layout and chart understanding to visual-text interpretation, map naturally onto tasks libraries care about, particularly metadata creation, which remains a human labor-intensive process. We saw that LVLMs could offer a way to bridge the gap between raw document images and meaningful, searchable representations. With a richer metadata backbone, we can then enable more powerful discovery and computational scholarship.

What challenges did you overcome?

Many moments over the year, it was tempting to jump straight into building something based on my own assumptions about what the PPA (and libraries and archives more broadly) need, despite having no prior experience with the PPA or the metadata creation process. One of the greatest challenges was resisting this urge. 

By adopting as few assumptions as possible, I spoke directly with staff in the CDH and University Library broadly, including Special Collections, Digital Scholarship, and Research Data, all of whom shaped how I thought about metadata, digitization, and what counts as “useful” information. Through their iterative feedback, I identified a few key themes that ultimately shaped my thesis’s system design.

What was one exciting takeaway from your project?

Out-of-the-box LVLMs already show promising performance, making it possible for archivists and librarians to start experimenting with their own collections — and even generate useful metadata right away.

Cost, speed, and performance trade-offs matter at scale. For example, the most accurate model I tested (OpenAI’s o1) wasn’t always the best choice in practice. In fact, a smaller and cheaper model (Google’s Gemini-2.0-Flash) proved ideal for many tasks, running 20 times faster and costing 680 times less per page than o1, while still delivering strong results. This suggests that even institutions with limited resources can leverage AI tools today, not just those with large budgets.

What inspired you to continue your work in the digital humanities for your senior thesis?

Since learning about the Center last year, I’ve been continually inspired by the wonderful projects the CDH supports. Digital humanities, as a field, brings together a unique blend of technical innovation and deep engagement with human history and culture.

What pushed me in particular was seeing how genuinely interested the staff at the CDH and PUL were in my research, even when it was just a proposal. Over time, I recognized that my work could actually support researchers and librarians in a meaningful, lasting way. That sense of purpose has carried me since.

What was your experience as a computer science student working with humanists? How did your involvement with the CDH complement your education in computer science, or how did it challenge it?

In many computer science classes, there’s usually a right answer. It is not quite so in the digital humanities — working with humanists at the CDH opened my eyes to the sheer messiness and ambiguity of real-world problems. I learned to make decisions based on practical tradeoffs, which pushed me to become more adaptable, thoughtful and collaborative in my approach to problem-solving.

We know you won a Schwarzman Scholarship and plan to study global affairs at Tsinghua University, congratulations! How do you see your work at the CDH connecting to your graduate study and future career?

My work at the CDH taught me the value of admitting what I don’t know and deferring to the expertise of others. I learned to ask questions first, build second. As I move into studying global affairs in the context of ensuring safe AI development at Tsinghua, I’ll carry that same humility and collaborative spirit to build bridges across fields and cultures.

What’s your advice to other computer science students interested in the intersection between CS and the humanities?

Channel an open mind! Resist the urge to assume you know what humanists need. Don’t be afraid to aggressively chase what you don’t know. Continually immerse yourself in the inevitably messy (and hopefully unfamiliar) data and listen more than you talk.