INF2-FDS: Revision

Revision resources and advice

Here are resources for revision:

  • Comprehension questions on Learn.
  • Mock test on Learn in Revision folder.
  • The workshop sheets and solutions, from week 4 and 6, which have exercises on sampling, confidence intervals, logistic regression, A/B testing and hypothesis testing.
     
  • If you would like more practice, similar to the workshop sheets, there are exercises at the end of each chapter of Modern Mathematical Statistics with Applications - the odd-numbered exercises have answers.
  • For extra practice some of the quizzes here may be helpful:  https://www.cliffsnotes.com/study-guides/statistics/statistics-quizzes/statistics-quizzes Though you may find some things not on our syllabus, and it may be that they use slightly different terminology.
  • We will be monitoring Piazza regularly up until the class test.  

Here are our suggestions for revision:

  1. Revise the notes you took in lectures
  2. Re-read the lecture notes, perhaps taking notes
  3. Do the sets of comprehension questions available in most topics. For some of the questions there are solutions.
  4. Look at the summary resource (below), that tries to link together the various aspects of the course up to the end of S1 Week 9.
  5. Do the questions from the S2 Week 4 and Week 6 workshop sheets.
  6. Do the mock test on Learn.
  7. Ask questions on Piazza, and do try to answer each other's questions, which is a valuable way of revising. We will leave posts for around a day to allow time for the students’ answer to develop, and then either endorse or expand on the answer.

Summary resource: overview of FDS S1 Weeks 1 to 9

This table relates the content we've covered in S1 week 1-9 to the data science process introduced in the Week 1 lectures. We've organised the material in to three aspects: methods, ethics and decisions that you need to make as a data scientist. This table is not definitive and it won't be examined - it's one way of looking at the material.

Data Science processMethodsEthicsDecisions
Ask an interesting question
  • Curiosity

Will answering the question affect individuals or groups 

  • positively (e.g. help to predict extreme weather events) or
  • adversely? (e.g. Facebook and emotion analysis)
  • What's the best way of formulating the question?
Get the data
  • Downloads of tables
  • API, RSS
  • Web scraping
  • Surveying
  • Check for missing or duplicated data (with pandas)
  • Merge tables (pandas)
  • Is this being done legally (e.g in frameworks like GDPR)?
  • Have subjects given their consent? (OK Cupid)
  • Are we obtaining the data in accordance with terms and conditions and not hogging bandwidth?
  • Is the data a representative sample, that will give us as unbiased an answer to the question as possible?
  • Can subjects be re-identified?
  • Where to search for the data?
  • Is the license suitable?
  • Is the available data sufficient, or do we need to collect more data?
  • What can we tell about the data quality from its description?
  • Is there potential measurement or selection bias? (See ethics too)
Explore the data
  • Plotting, e.g. scatter plots (multivariate), histograms (univariate)
  • Univariate summary statistics: mean, median, s.d, mode or modes of distribution
  • Bivariate summary statistics: linear correlation coefficient
  • PCA (to help with visualisation)
  • Are we checking the data integrity so that we’ll get an accurate answer?
  • Are our plotting decisions giving a full view of the data?
  • How to merge the data?
  • How to clean the data?
  • What plot to use?
  • What summary statistics to use?
  • How many PCA dimensions to retain?
  • How to aggregate the data (e.g. group by day, week or month)?
Model the data
  • Prediction (supervised learning)
    • Linear regression
    • Multiple linear regression
    • Classification with k-NN (and with logisitic regression in S2)
  • Evaluation
    • Visual diagnostics
    • Numerical diagnostics: RMSE, R2 (Regression)
    • Classification accuracy
    • Variance explained (PCA)
  • Are the results fair and unbiased? (e.g. recruitment, COMPAS)
  • Have we ignored any lurking/confounding variables?
  • Are we applying methods correctly?
  • Which method(s) to use?
  • Which variables to include?
  • Do we need to apply variable transforms?
  • Are the diagnostics acceptable?
  • Does the answer make sense?
Communicate and visualise the results
  • Visualisations, including principles such as:
    • Graphical integrity
    • Data-ink (i.e. rich information)
    • Accessibility – suitable size of text and use of colour
    • No chartjunk
    • Annotations
    • Multiples
  • Tables
  • Written descriptions of the data
  • Storytelling – truthful of course.
  • Are our visualisations and descriptions truthful, i.e. not exaggerating or hiding effects?
  • Are we reporting test error, not training or validation error?
  • Are the reporting metrics appropriate? E.g. accuracy can make performance look good with unbalanced classes.
  • What data to plot?
  • What type of plot to use, e.g.:
     
    • lineplot often helpful for time series
    • box plots convey distributional information
    • etc.
  • Is a table or plot better?
  • What annotations should we use?
  • What’s the main story that the data tells?
License
All rights reserved The University of Edinburgh