Week 3: N-gram Language Models
Reminders and Announcements
Welcome to the start of our third week! We are still seeing a sound amount of activity on Piazza (thanks especially to those of you who have answered other students' questions!), and it's great to see you all in the live sessions and labs. I hope you are all getting to know some other students in the class by now and, if you are new to Edinburgh, also getting to know the city a little bit. Here are some reminders for this week:
- Week 3 lab. This week we have our second lab dedicated to working with probability distributions [pdf].
- Change of time for help hours. Since we realised that the previous time clashed with a mandatory activity, we have now found a new time slot: starting this week, help hours will take place on Tuesdays between 12 and 1 pm in room AT 5.07.
Moreover, we have scheduled extra help hours over the next two weeks to assist you with the assignment. These will also take place in the usual room (AT 5.07).
Day Time Mon 7th Oct 10am Tue 8th Oct noon Thu 10th Oct 3pm Fri 11th Oct 10am Mon 14th Oct 10am Tue 15th Oct noon - Assignment 1. This week we will also release the first assignment (or coursework) for ANLP. Please take note that:
- The instructions for the assignments will become available here on Wednesday [pdf].
- Around the same time, we will also announce the pairings (note: updated on 5/10/24). You can find your group for the assignment here [cvs] (including students working alone). Please contact your partner (if any) as soon as possible to start working together on the assignment. To do so, you can simply email UUN@sms.ed.ac.uk (your partner's UUN is listed in both the .txt and .cvs files).
- After the assignment has been released, please write your answers on the submission template [tex] [docx], compile a pdf and upload it on Gradescope by October 17th at noon.
- Next week's tutorial (discussion) questions. These are the questions that will be discussed in the first tutorial group meetings next week (Week 4). You'll be expected to participate in the discussion, so please review all the questions in advance.
- Week 2 lab solutions are now available here.
Overview of Week 3
This week, we will introduce language models, i.e., distributions over finite-length strings. I will provide both an intuitive and a formal definition. We will first survey how to evaluate language models and how to sample sentences from them. Next, I will describe a specific algorithm to estimate language models, namely n-grams (count-based). Finally, to regularise these models in order to avoid overfitting, we will cover a series of advanced smoothing techniques.
Lectures and Readings
Lecture # | Who? | Slides | Reading |
---|---|---|---|
1 | EP | Language models | JM3 2.5, JM3 3.2-3.4 Optional: Formal Aspects of Language Modeling §2.0-2.4 |
2 | EP | N-gram LMs | JM3 3.0-3.1 |
3 | EP | Advanced Smoothing | JM3 3.5-3.6 |
Errata corrige:
- I changed the notation of the vocabulary set from Σ to 𝒱 to avoid confusion with the symbol for summation.
- I amended the formula for corrected counts in W3 L3 (slide 9) which had a typo.
Tutorial (discussion) groups
By this time, you should have received your group assignment for the ANLP tutorials, and it should show up in your timetable. If you have a regular conflict with the time of your tutorial (or lab) group, you may request a switch using this link.
The first group meetings will take place next week but will discuss questions related to earlier weeks' content. Tutorial groups will be based around the discussion of open-ended questions. This is in contrast to lecture quizzes, which cover technical questions with a single answer, which can be automatically checked. Tutorials and quizzes reflect the two kinds of questions you will encounter in the final exam, so it is important to engage with them to prepare yourself properly. If you have questions about the more technical content, please continue to make use of Piazza, the live lecture session, and/or drop-in help hours to ask your questions.