Data

Lecture

Guest expert: Srravya Chandhiramowuli

Title: Making AI work: Exploring how training datasets are produced and why that matters

Abstract

AI systems have made the headlines quite frequently over the last couple of years, most often for their remarkable computational capabilities. Lesser known and rarely acknowledged is the human labours involved in producing datasets for training and supporting these celebrated AI systems. Millions of workers, particularly in global south regions, are engaged in creating large-scale annotated datasets to sustain AI’s research, development and use. Yet little is known about what their work entails. What do data annotators do when they label data for AI? And how is this work organised within the AI supply chain to continually feed into, train and sustain AI systems? This lecture will address  these questions by drawing on insights from an ethnographic study of data annotation work conducted in data outsourcing centres in India. We will unpack the practices, norms and frictions in dataset production as well as answer an important question, why should we, as AI developers and technologists, concern ourselves with these seemingly distant realities? 

Recommended papers:

(Consider these optional, in addition to the usual readings below)

Milagros Miceli, Julian Posada, and Tianling Yang. 2022. Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power? Proc. ACM Hum.-Comput. Interact. 6, GROUP, Article 34 (January 2022), 14 pages. https://doi.org/10.1145/3492853  

Ding Wang, Shantanu Prabhat, and Nithya Sambasivan. 2022. Whose AI Dream? In search of the aspiration in data annotation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Association for Computing Machinery, New York, NY, USA, Article 582, 1–16. https://doi.org/10.1145/3491102.3502121 

Noopur Raval. 2021. Interrupting invisibility in a global world. Interactions 28, 4 (July - August 2021), 27–31. https://doi.org/10.1145/3469257 

Srravya Chandhiramowuli, Alex S. Taylor, Sara Heitlinger, and Ding Wang. 2024. Making Data Work Count. Proc. ACM Hum.-Comput. Interact. 8, CSCW1, Article 90 (April 2024). https://doi.org/10.1145/3637367 

Slides: 

Reading

Required - Data Justice

"What is data justice? The case for connecting digital rights and freedoms globally"

https://journals.sagepub.com/doi/full/10.1177/2053951717736335

This paper provides an introduction to some of the disparity different people experience in the gathering or use of their data. It also proposes a solution, which is less critical if you'd rather read one or more of the optional readings.

Optional - The Work Behind Data

"Justice for "Data Janitors""

https://www.publicbooks.org/justice-for-data-janitors/ or https://www.degruyter.com/document/doi/10.7312/marc19008-003/html

This chapter discusses the amount of hidden work done behind the scenes by people who get very little credit for enabling the power of Big Data.

If this is particularly interesting to you, check out the book "Ghost Work" by Mary L Gray and Siddharth Suri.

Optional - The Future of Data

"Data as Property"

https://phenomenalworld.org/analysis/data-as-property

A more challenging read talks about possible future directions for data ownership.

Optional - Data Ownership

"It’s time for a Bill of Data Rights"

https://www.technologyreview.com/2018/12/14/138615/its-time-for-a-bill-of-data-rights/

This essay argues that ownership is not only a poor way to solve problems with the use of people's data, but actually introduces new problems.

Optional - Values in Datasets

"Making Intelligence: Ethical Values in IQ and ML Benchmarks"

https://dl.acm.org/doi/pdf/10.1145/3593013.3593996

This paper explores how ethical values end up entangled in supposedly objective benchmarks, in this case in relation to the use of IQ in machine learning research.

 

License
All rights reserved The University of Edinburgh