COS484: Natural Language Processing

COS484 Final Project Guidelines

COS 597E, Princeton
6.806, MIT
CS 224N, Stanford

1. Domain adapation of pre-trained word embeddings: Methods like ELMO, BERT are trained on large web text. However, these representations do not work as well for specialized domains like medical text. One solution is to re-train these representations on medical text (e.g. PubMed), but this is computationally expensive and requires lots of data. Can we build methods to adapt the existing pre-trained representations to new domains like medical text, while using only a small amount of data? Performance can be tested on tasks from the BLUE benchmark.

2. Combine edit-based language generation models ( Guu et al., Hashimoto et al.) with character level language models ( Kim ). The idea is to first generate a sentence using the character-based model and then iteratively refine it using character-level edits. Potential advantages could be faster inference since output layer is smaller, lesser parameters, more use of orthographic structure, which is shared at both input and output layers.

3. Energy-based models (EBMs) allow us to perform search at inference time, potentially allowing for different types of reasoning. Can we learn energy-based models over characters, words and sentences and use them for tasks like spelling correction or sentence editing by performing gradient descent over the energy landscape?

4. The SemEval 2020 competition has several subtasks. You are free to pick one of these that interest you. As a bonus, you could even participate in the official competition if your model performs well!