FNLP: Week 2: Annotation, Evaluation and Language Models

Last week, we talked about how NLP must meet the challenge of resolving ambiguity, and that machine learning applied to corpora can help to meet that challenge. This week, we will talk some more about corpora and how one annotates it with the kind of (latent) information that you want your NLP model to acquire via machine learning methods. We will also talk about various metrics that you can use to evaluate how well your NLP system is performing. We will then focus on a particular task in NLP: language models. A language model is a probability distribution over word sequences: it tells you how likely, or unlikely, a given sequence of words is to occur. Language models form a part of many NLP applications. We are going to focus in detail on how one can acquire a language model from corpus data, because it provides a forum in which we can introduce challenges that are common to all machine learning approaches to building NLP systems---in particular, the sparse data problem.

The content in this folder is structured as follows:

4: Methods in Annotation and Evaluation

5: N-gram Language Models

6: Evaluation and Smoothing

As always, each of the above includes videos, the slides that were used in the videos, required readings, and a post-lecture quiz. The quiz is a chance for you to gauge your understanding of the material presented here, and so we strongly encourage you to review this content in the above order, and then complete the quiz. If there is anything you don't understand, then you have several options:

Post a question on piazza;
Ask a question at the in person lectures; and/or
Ask your tutor.

License