Week 6: Attention and Transformers
Reminders and announcements
Welcome to Week 6! Our course is now half over, and while you may feel overwhelmed with all the semester activities, you have already made significant progress. In fact, we have covered more than half of the examinable material at this point. (To give you time to work on the assignment in weeks 8-10 we won't have as many examinable lectures then.)
- Solutions to tutorial 2 are now available. 
- Lab 3 group meetings take place this week. In this lab, we will walk through the implementation, training, and evaluation of a sequence-to-sequence Recurrent Neural Network (RNN). We will focus on evaluating the model on the task of mapping a sequence of instructions to a sequence of actions, based on the SCAN dataset. This will also give us an opportunity to discuss important aspects of learning, such as compositional and out-of-distribution generalisation. 
- You may want to start thinking about your partner for the assignment. More information about this at the bottom of the page. 
- Preview of next week's reading: - Mask language models: 10.0-10.2.1, 10.3-10.4 
- Large language models: 7.1-7.2 
 
Overview of this week
The goal of this week is to introduce the attention layer, which was originally developed for sequence-to-sequence RNNs but was subsequently applied to Transformers. We will motivate attention from both linguistic and optimisation viewpoints and show several variants (e.g., additive and multiplicative attentions) and explain the purpose of having multiple "attention heads" within the same attention mechanism. Finally, we will show how attention is a key module of Transformer layers, when combined with feed-forward layers, highway connections, layer normalisation, and positional encoding. Finally, we will cover the complexity and interpretability of attention.
Lectures and readings
| Lecture # | Who? | Slides | Reading | 
|---|---|---|---|
| 1 | EP | Attention in sequence-to-sequence models | JM3 13.8 (*), 12.0-12.3 (*) | 
| 2 | EP | Self-attention in Transformers | JM3 8.1 (*), 8.3 (*) | 
| 3 | EP | Transformer architecture | JM3 8.2 (*), 8.8.2 (*) | 
Assignment partnering
- At the end of week 7, we will release the specification of the ANLP assignment, which is worth 30% of the final mark for our course. Students are encouraged to work on the assignment in pairs. You have three options: 1) choose your own partner; 2) ask us to find you a partner; 3) decline to be paired with a partner and work on your own.- Please indicate your preference on the ANLP 2025 partnering form before Monday, 27th October at noon.
- If you have found your own partner, one of you should fill out the form to inform us of the partner's details.
- We will only assign you a partner if you fill in the form. If you want to work alone, you don't need to fill in the form. However, previous experience suggests that pairs tend to do better on the assignment than individuals.