Revision | Open Course Materials

Revision resources and advice

Here are resources for revision:

Comprehension questions on Learn.
Class Test from 2023/24 in the Assessment->Previous Assessment folder in Learn.
The workshop sheets and solutions, from week 4 and 6, which have exercises on sampling, confidence intervals, logistic regression, A/B testing and hypothesis testing.
If you would like more practice, similar to the workshop sheets, there are exercises at the end of each chapter of Modern Mathematical Statistics with Applications - the odd-numbered exercises have answers.
For extra practice some of the quizzes here may be helpful: https://www.cliffsnotes.com/study-guides/statistics/statistics-quizzes/statistics-quizzes Though you may find some things not on our syllabus, and it may be that they use slightly different terminology.
We will be monitoring Piazza regularly up until the exam.

Here are our suggestions for revision:

Revise the notes you took in lectures
Re-read the lecture notes, perhaps taking notes
Do the sets of comprehension questions available in most topics. For some of the questions there are solutions.
Look at the summary resource (below), that tries to link together the various aspects of the course up to the end of S1.
Do the questions from the S2 Week 4 and Week 6 workshop sheets.
Do the class test on Learn.
Ask questions on Piazza, and do try to answer each other's questions, which is a valuable way of revising. We will leave posts for around a day to allow time for the students’ answer to develop, and then either endorse or expand on the answer.

Summary resource: overview of FDS S1

This table relates the content we've covered in S1 week 1-9 to the data science process introduced in the Week 1 lectures. We've organised the material in to three aspects: methods, ethics and decisions that you need to make as a data scientist. This table is not definitive and it won't be examined - it's one way of looking at the material.

Data Science process	Methods	Ethics	Decisions
Ask an interesting question	Curiosity	Will answering the question affect individuals or groups positively (e.g. help to predict extreme weather events) or adversely? (e.g. Facebook and emotion analysis)	What's the best way of formulating the question?
Get the data	Downloads of tables API, RSS Web scraping Surveying Check for missing or duplicated data (with pandas) Merge tables (pandas)	Is this being done legally (e.g in frameworks like GDPR)? Have subjects given their consent? (OK Cupid) Are we obtaining the data in accordance with terms and conditions and not hogging bandwidth? Is the data a representative sample, that will give us as unbiased an answer to the question as possible? Can subjects be re-identified?	Where to search for the data? Is the license suitable? Is the available data sufficient, or do we need to collect more data? What can we tell about the data quality from its description? Is there potential measurement or selection bias? (See ethics too)
Explore the data	Plotting, e.g. scatter plots (multivariate), histograms (univariate) Univariate summary statistics: mean, median, s.d, mode or modes of distribution Bivariate summary statistics: linear correlation coefficient PCA (to help with visualisation)	Are we checking the data integrity so that we’ll get an accurate answer? Are our plotting decisions giving a full view of the data?	How to merge the data? How to clean the data? What plot to use? What summary statistics to use? How many PCA dimensions to retain? How to aggregate the data (e.g. group by day, week or month)?
Model the data	Prediction (supervised learning) Linear regression Multiple linear regression Classification with k-NN (and with logisitic regression in S2) Evaluation Visual diagnostics Numerical diagnostics: RMSE, R2 (Regression) Classification accuracy Variance explained (PCA)	Are the results fair and unbiased? (e.g. recruitment, COMPAS) Have we ignored any lurking/confounding variables? Are we applying methods correctly?	Which method(s) to use? Which variables to include? Do we need to apply variable transforms? Are the diagnostics acceptable? Does the answer make sense?
Communicate and visualise the results	Visualisations, including principles such as: Graphical integrity Data-ink (i.e. rich information) Accessibility – suitable size of text and use of colour No chartjunk Annotations Multiples Tables Written descriptions of the data Storytelling – truthful of course.	Are our visualisations and descriptions truthful, i.e. not exaggerating or hiding effects? Are we reporting test error, not training or validation error? Are the reporting metrics appropriate? E.g. accuracy can make performance look good with unbalanced classes.	What data to plot? What type of plot to use, e.g.: lineplot often helpful for time series box plots convey distributional information etc. Is a table or plot better? What annotations should we use? What’s the main story that the data tells?

License

Revision resources and advice

Summary resource: overview of FDS S1

Search

Navigation