I am a Ph.D. candidate in Department of Computer Science at Princeton University, where I work with Prof. Jia Deng in Princeton Vision & Learning Lab. I also collaborate closely with Prof. Olga Russakovsky. My research focuses on bridging deep learning and symbolic reasoning, with applications in automated theorem proving and mathematical reasoning in natural languages. Prior to that, I worked in computer vision, including topics such as human poses, visual relationships, and fairness.

I received my master’s degree from the University of Michigan and my bachelor’s degree from Tsinghua University.

[杨凯峪] [Email: kaiyuy@cs.princeton.edu] [CV]


  • 9/2020 Two papers accepted to NeurIPS 2020!

  • 6/2020 I am awarded Outstanding Reviewer by CVPR 2020!

  • 2/2020 Read about our work to improve the fairness and representation of ImageNet in Princeton Engineering News.


Strongly Incremental Constituency Parsing with Graph Neural Networks
Kaiyu Yang and Jia Deng
Neural Information Processing Systems (NeurIPS), 2020
[paper] [code] (TBA)

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D
Ankit Goyal, Kaiyu Yang, Dawei Yang, and Jia Deng
Neural Information Processing Systems (NeurIPS), 2020, Spotlight
[paper] [code] (TBA)

Towards Fairer Datasets:
Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy

Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky
Conference on Fairness, Accountability, and Transparency (FAT*), 2020
[paper] [slides] [talk] [blog] [media]

Learning to Prove Theorems via Interacting with Proof Assistants
Kaiyu Yang and Jia Deng
International Conference on Machine Learning (ICML), 2019
[paper] [code] [slides] [poster]

SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition
Kaiyu Yang, Olga Russakovsky, and Jia Deng
International Conference on Computer Vision (ICCV), 2019
[paper] [code] [poster]

Stacked Hourglass Networks for Human Pose Estimation
Alejandro Newell, Kaiyu Yang, and Jia Deng
European Conference on Computer Vision (ECCV), 2016
[paper] [code]



CoqGym: We use machine learning to automatically prove theorems, including not only theorems in math but also theorems describing the behavior of software and hardware systems. Current theorem provers usually search for proofs represented at a low level, such as first-order logic and resolutions. Therefore they lack the high-level reasoning and problem-specific insights common to humans.

In contrast, we use a powerful set of tools called proof assistants (a.k.a. interactive theorem provers). These are software that assists human experts in proving theorems. They thus provide a high-level framework that is close to human mathematical reasoning. Instead of humans, we develop machine learning agents to interact with proof assistants. Our agent can learn from human interactions by imitation learning using a large amount of data available online. We use this data to construct a large-scale dataset for training/evaluating the agent. We also develop a baseline model that can prove many new theorems not provable by existing methods.



Adversarial Crowdsourcing and SpatialSense: Benchmarks in vision and language suffer from dataset bias—models can perform exceptionally well by exploiting simple cues without even looking at the image, which undermines the benchmark’s value in measuring visual reasoning abilities. We propose adversarial crowdsourcing to reduce dataset bias. Annotators are explicitly tasked with finding examples that are difficult to predict using simple cues such as 2D spatial configuration or language priors. Specifically, we introduce SpatialSense, a challenging dataset for spatial relation recognition collected via adversarial crowdsourcing.



Fairer and More Representative ImageNet Computer vision technology is being used by many but remains representative of only a few. People have reported misbehavior of computer vision models, including offensive prediction results and lower performance for underrepresented groups. Current computer vision models are typically developed using datasets consisting of manually annotated images or videos; the data and label distributions in these datasets are critical to the models’ behavior.

In this paper, we examine ImageNet, a large-scale ontology of images that has spurred the development of many modern computer vision methods. We consider three key factors within the person subtree of ImageNet that may lead to problematic behavior in downstream computer vision technology: (1) the stagnant concept vocabulary of WordNet, (2) the attempt at exhaustive illustration of all categories with images, and (3) the inequality of representation in the images within concepts. We seek to illuminate the root causes of these concerns and take the first steps to mitigate them constructively.



Stacked Hourglass Networks: We introduce the hourglass network: a novel convolutional network architecture for human pose estimation. It is now a standard component in many state-of-the-art methods for pose estimation.



  • During my undergraduate, I served as a TA for Data Structures and Algorithms at Tsinghua University, which was offered to both on-campus students and the general public as a massive open online course (MOOC). I received Outstanding Teaching Assistant Award twice in 2015 and 2016. Besides regular TA responsibilities such as grading and office hours, I also dealt with the online infrastructure for MOOC.