FNLP: Week 3: Important ML techniques for NLP

The content of the lectures this week will introduce you to some very important algorithms and machine learning models that are useful for many NLP tasks, including learning language models. These include dynamic programming (and in particular minimum edit distance), noisy channel models, expectation maximisation (EM) and Naive Bayes.

We'll continue to talk about language models, and in particular you will learn about more approaches to smoothing, which as you learned last week is a technique for tackling the sparse data problem: the fact that during testing, the system will see data that was entirely absent from the training data. This week, in particular, we'll focus on smoothing techniques which capture the fact that not all unseen data should be assigned the same likelihood.

We'll then return to one of the NLP applications that uses language models as a component: namely, spelling correction. We'll talk about how a spelling correction system can be acquired from data. This application provides a good opportunity to introduce noisy channel models: for spelling correction, the noisy channel model has a language model and an error model as components. For spelling correction, an error model is a probability distribution P(c|c'), on the character c that gets typed (potentially by mistake!), given the character c' that the author intended to write. We'll discuss the various ways in which you can learn an error model from data, and this gives us a chance to introduce edit distance and expectation maximisation. When combined together, they provide a way of learning an error model from a corpus of non-edited documents that are paired with the edited versions, even though the mapping from character(s) in the edited version to those in the non-edited version is missing or latent.

Finally, we'll talk about an alternative model for learning from data to that of the noisy channel model, namely Naive Bayes. This is a very simple model, but surprisingly effective at tackling various NLP tasks.

The content in this folder is structured as follows:

7: More smoothing and the Noisy Channel Model

8: Spelling Correction, edit distance and EM

9: Naive Bayes

As always, each of the above includes videos, the slides that were used in the videos, required readings, and a post-lecture quiz. The quiz is a chance for you to gauge your understanding of the material presented here, and so we strongly encourage you to review this content in the above order, and then complete the quiz. If there is anything you don't understand, then you have several options:

Post a question on piazza;
Ask a question at the in person lectures; and/or
Ask your tutor.

License