Week 1: Language as data (structure and statistics)

What to expect in each week's content page

Welcome to Week 1! From now on, each week you'll see a new content page structured very much like this one. The usual structure will be:

  • reminders and announcements (as needed);
  • a brief overview of the week's content;
  • a section listing the lecture topics and readings. We will aim to add links to the lecture slides no later than the evening before each lecture, and earlier if possible. However, the course is undergoing major updates this year, so please bear with us if we are occasionally later than hoped, or need to post updates afterward.
  • a section listing any additional materials, such as lab materials or other exercises you will need to do.

Overview of Week 1

In lectures this week, we will start by introducing ourselves and providing an overview of how this course is structured, and how to get the most out of it. We'll also discuss some of what makes langauge different from other types of data: ambiguity, structure, and statistical properties. We'll get into more detail about some of these issues in lectures 2 and 3, where we focus in more depth on words, from both a linguistic and statistical perspective. A common feature of words across many languages is that they are built from smaller pieces, called morphemes. Looking at morphemes will help us understand some of the ways that languages can differ from each other, as well as some of the similarities. Finally, we will talk about a different way to split words into smaller units, using a purely data-driven approach.

Lectures and reading

Lecture #Who?SlidesReading 
1SGIntroduction

Course Guidance

Week 0: Preparation steps

Mathematics tutorials (see note below)

2SGWords as data: corpora, statistics, and word classes 

JM3 2.0-2.1(*), 2.6 (*), 2.7, 17.1

(see ANLP: Readings for info)

3SGMorphology and sub-word tokenizationJM3 2.2 (*), 2.3, 2.4 (*)

Note on mathematics tutorials: Required for those students who are not already familiar with these concepts. Note that the course material from Week 2 onward assume you're familiar with these concepts, especially Probability Theory. You should get started on these as soon as possible, and work through them by the end of this week.

Additional materials

To successfully learn from this course, you will need to actively work on it. Each week we will provide additional materials such as labsl and tutorial exercises to help you build up your understanding from remembering basic facts to applying methods to synthesizing different ideas and considering strengths and weaknesses of different approaches. Your answers to these problems do not affect your course mark, but it will be very difficult to do well in the course without working through them yourself (or together with other students), before looking at any solutions. 

While it may be tempting to check solutions before working through the problems yourself, that is not a good way to learn the material you need to know for the exam; it is too easy to fool yourself into believing that you would have gotten the right answer. Please work through the exercises and ask questions if you get stuck, or if you are still confused even after seeing the solutions. You can ask your fellow students or on the Piazza discussion forum; go to the weekly TA help hour; or ask during timetabled labs and tutorials. (See this page for details of where to get help.)

  • ANLP 25-26 intake form (to complete by Thursday eve, should take less than 5 minutes).
    • Please complete the form even if you’re still deciding whether to take the course. But please don’t complete it if you are just planning to audit (register class-only).
    • The form will ask a few quick questions about your background. We use this information to get a better sense of the range of students in the class and for course planning and management.
  • Lab 0: Introduction to Noteable (to complete by the end of this week, should take 1-2h depending on your previous experience)
    • We will be using Noteable for all of the labs in this course. Please work through this tutorial on your own time to familiarize yourself with Noteable, so that you'll be ready to get started next week with the labs focusing on course content.
    • If you're already familiar with Jupyter notebooks, you'll be able to skim/skip some parts of the lab, but please see the introduction for which parts you should still work through.
    • Starting next week with Lab 1, the remaining labs will be timetabled. During these lab sessions, we ask you to work with a partner (not assigned in advance, you can find someone when you get to the lab!), and demonstrators will be available to help if you have questions.
License
All rights reserved The University of Edinburgh