Week 2: N-gram models

Reminders and Announcements

Welcome to week 2! A big congratulations to all of you for making it through the first week! It can be a lot to keep track of everything at the start, so here are a few reminders:

  • Lab sessions this week!
    • If you are not registered for the course by Monday morning, you won't be assigned to a lab session yet. In that case, please try to attend one of the ones with free slots (see my Learn announcement on Friday).
    • The lab is available below, under Additional Materials. Before your scheduled lab, please work through Lab 0 from last week, and also check that you can log into a DICE machine (i.e. in one of the Appleton Tower labs).
    • You don't need to find a lab partner in advance, just find a partner once you get to the lab room.
  • Response to Intake Form. Please see the bottom of this page for our responses to the intake form.
  • TA hours: A reminder that we have drop-in help hours with a TA every Fri from 10-11 am in AT 5.07, if you have a question that's not easily answered on Piazza, or just prefer to ask in person!
  • If you aren't yet registered, only recently registered, or missed a lot of Week 1:
    • Especially important to do as soon as possible: Watch the lecture 1 video and read the Course Guidance document, so you know how the course is structured.
    • Then, start working through the rest of the Week 1 materials.
    • If you run into problems with either of the above, please post to Piazza. In the meantime, keep working through other material to catch up.

Overview of Week 2

In the lectures this week, we will introduce the notion of a model and how we can train and evaluate one based on data. Specifically, we will focus on n-gram models, a type of language model. We'll introduce these models, and how to train and evaluate them. Although they are not nearly as good as today's large language models, N-gram models are also generative language models! So we'll build up intuitions about how to generate from language models by explaining this using N-gram models. We will also talk about how to avoid overfitting, an issue we'll see again throughout the course.

Lectures and reading

Lecture #Who?SlidesReading 
1SGProbability, models, and dataProbability theory tutorial
2SGN-gram modelsJM3 3.0-3.1 (*), 3.3
3SGSmoothing and SamplingJM3 3.2 (*), 3.4-3.6 (*), 8.6

Additional Materials

  • Lab 1 is now available.
    • Before your scheduled lab, make sure you've completed Lab 0 from last week, and you know your DICE account/password to log into the computers in the Appleton Tower labs.
      • You should already have a DICE account if you are an Informatics MSc student or as soon as you register for this class.
      • If you do yet not have a DICE account, please make sure to pair up during the lab session with someone who does.
    • The rest of the lab should be done during the lab session, working with a partner. To access the lab, follow the instructions at the link above.
  • Preview of next week's reading. Some students have asked if we can post the reading for the next week ahead of the weekly page release, so I'll start doing that here with the reading for the first two lectures next week, which cover Ch 4:
    • For Monday: JM3 4.0-4.4 (all * except 4.3.2-4.3.3). There are some known typos:
      • Eq 4.11: y(i) should have a hat
      • Eq 4.14: is missing sigma
    • For Wednesday: JM3 4.5-4.7.0 (*), 4.7.1-4.8, 4.9 (*)
  • ILCC seminars. Some students have also asked about opportunities to learn more about NLP research. One way to do that is by attending talks in the ILCC seminar series, which hosts external speakers, many of them related to NLP.

Response to intake form and students' worries

Thank you to everyone who completed our intake form: we received 157 responses. We now know that there are (at least) 42 different languages (and several more dialects) that you speak at home, plus another 17 you are familiar with! This means that you'll have a rich source of different language experiences to inform your discussions in tutorial groups and other interactions with each other.

Just over half of respondents have an academic background in computer science (51%), but there are significant numbers of students from linguistics, maths/physics, and engineering backgrounds. So take this opportunity to also engage with your peers who are knowledgeable in different disciplines. In other areas the class also varies a lot, with a few students who've taken NLP classes before, but also nearly a third with no previous technical experience with NLP---though almost everyone said they use generative NLP at least sometimes (with about 30% using it many times a day).

We have also looked over the concerns people expressed in the intake form. While we can't address every concern that was raised, there were a few that came up repeatedly:

  • Skills with English and/or writing: Unfortunately, given the size of this course and the many other learning objectives, we can't offer much individual help with these issues, though we will try to provide some general writing guidance in assignment handouts. If you are worried about these issues, please look over the University pages on support for English and the Institute for Academic Development, which offers resources on academic writing as well as many other topics.
  • Background knowledge/skills: Various students were worried about the aspects of the course that they are less familiar with (linguistics, programming, maths).
    • Please remember that only a handful of students in the course have all of this background, and we certainly don't expect any previous experience with NLP, so almost all of you will need to stretch your thinking in some way. We are used to getting a wide variety of students in this class, and although you will need to work hard, it is definitely possible to do well in the end even if you are rather rusty on your maths, or have little experience with programming, or if you don't know anything about liguistics. (If you don't have any of those things, though, it might be reasonable to consider whether this class is right for you.)
    • Your classmates are a great resource, which is one reason why we try to give you opportunities to meet other students, and why we pair you up for the labs and assignments. We can't guarantee that you will get a partner from a different background to yourself, but you'll have a chance to work with different partners, which increases the chances. We also encourage you to find other students to study with, and to ask them or post to Piazza when you're confused. We've already seen some good examples of students helping other students there!
  • Reading: A few students are concerned about the amount of reading, or knowing which reading is necessary. Please note that we do indicate which parts of the reading to prioritize with a *, so you know which sections are most relevant to the day's lecture.
  • Past exam papers: A few students also seem to be worried after looking at past exam papers. Please keep in mind:
    • We would not expect you to understand or be able to answer past exam questions during the first week of class!
    • However, it is true that if you are used to being able to predict very closely what questions will be on the exam by looking at previous exams, you will need to find a new strategy. We rarely re-use questions, and we often ask you to use what you've learned and apply it to new situations. We encourage you to practice these skills during the tutorial discussions.
    • Furthermore, while the format and style of questions on the exam will be similar this year, there are many topics covered in previous years that we have removed to make room for new content. So there will also be a lot of questions from past papers that you may not be able to answer even after taking this year's class, as well as new content that we've never asked about before! We will provide examples of more realistic exam questions for this year during some of the tutorials and nearer the end of the course.

 

License
All rights reserved The University of Edinburgh