Last week, we talked about how NLP must meet the challenge of resolving ambiguity, and that machine learning applied to corpora can help to meet that challenge. This week, we will talk some more about corpora and how one annotates it with the kind of (latent) information that you want your NLP model to acquire via machine learning methods. We will also talk about various metrics that you can use to evaluate how well your NLP system is performing. We will then focus on a particular task in NLP: language models. A language model is a probability distribution over word sequences: it tells you how likely, or unlikely, a given sequence of words is to occur. Language models form a part of many NLP applications. We are going to focus in detail on how one can acquire a language model from corpus data, because it provides a forum in which we can introduce challenges that are common to all machine learning approaches to building NLP systems---in particular, the sparse data problem.
The content in this folder is structured as follows:
4: Methods in Annotation and Evaluation
5: N-gram Language Models
6: Evaluation and Smoothing
- Post a question on piazza;
- Ask a question at the in person lectures; and/or
- Ask your tutor.