Data Science Prerequisites: statistics (2/5)

As such, statistics plays a central role here, allowing the practitioner to draw conclusions about the characteristics of a population by analyzing a sample of it, which is usually related to the products and/or clients of the company.

Data Science today has become a very general term, with each party adopts its own definition of it, according to their context and business objectives. As the data job market continues to mature, it has been observed recently that a few key roles or breeds with specific requirements and visions are starting to form, all derived from the original field. This trend is paving the way for the specialist profile to shine, eclipsing the data science generalist in the way. One of these variants, called the data scientist – product analytics, or equivalently, data scientist type A (A for analysis) is by far the most prevalent one. This profile can be matched with that of a statistician that employs more sophisticated tools and has a vast experience dealing with relatively large and messy datasets. 

In this blog post, we will go over what statistics is, the important role it plays in data science, the fundamental concepts that need to be learned, and lastly, we will provide a curated list of resources to cover them.

How to define statistic?

Statistics is the branch of mathematics that deals with the collection, organization, analysis and interpretation of data. It is particularly useful when we need to draw conclusions about a population through analyzing a finite sample from this population. Statistics and Probability are two related but separate academic disciplines. Statistical Analysis often uses probability distributions and theorems, and the two subjects are usually studied together. Statistics, together with Probability, can be found literally everywhere in Data Science, from data analysis and visualization to machine learning and statistical modeling.

Intermediate to Advanced knowledge in statistics is desired for data science work.

Statistic Concepts

  1. Descriptive Statistics: Central Tendency (Mean, Median, Mode), Dispersion (Variance, Standard Deviation, Percentiles, Interquartile Range), Shape (Skew, Kurtosis), Dependence (Pearson’s Correlation, Spearman’s Correlation, Kendall’s Correlation), Graphics (Bar Chart, Box Plot, Histogram, Scatter Plot …).
  2. Statistical Inference: Statistical Theory (Population, Sample, Statistic, Sampling Distribution, Density Estimation), Frequentist Inference (Point Estimation, Confidence Intervals, Hypothesis Testing), Bayesian Inference (Bayesian Probability, Credible Intervals, Bayesian Estimators).
  3. Regression Analysis: Linear Regression (Ordinary-Least Squares, General Linear Model, Bayesian Regression), Analysis (Errors and Residuals, Q-Q Plots, Validation, Interpretation), Generalized Linear Models (Exponential Family).
  4. Experimental Design
  5. Biases & Paradoxes: Simpson’s paradox, Freedman’s paradox, Survivor’s bias, Gambler’s fallacy, Selection Bias …

For further more

In order to improve your skills, or simply to start learning statistics, we propose you some resources, classified by type of source.

YouTube Playlists

Channel: Khan Academy

Playlist: Statistics

Description: Another playlist from Khan Academy offering 67 videos of ~10 mins each. The playlist provides a solid foundation of statistics (both descriptive and inferential). It will teach you most of what you need to know about statistics to get started with data science. It is highly recommended that you complement it with some practical exercises to help persist all the information in memory.


Author: Allen Downey

Title: Think Stats

Description: A very good introductory book to applied statistics using python programming. For each concept, an intuition is provided along with an explanation for why it is useful. Then the author outlines the mathematical formulation and turns it into python code (from scratch). The book is applied, avoids heavy math and essentially provides you with working knowledge of statistics without heavy proofs and equations. It is highly regarded by the data science community as well. The PDF version of the book is freely available online.

Author: Allen Downey

Title: Think Bayes

Description: Another book from the same author, which also comes highly recommended by the community as well. This should introduce you to different concepts of bayesian statistics, and hone your bayesian thinking, which you will need for machine learning. Similarly, the book relies on python to code everything from scratch. A PDF version is freely available online.

Author: Casella & Berger

Title: Statistical Inference

Description: This is one of the best books on inferential statistics out there. It is used as a textbook for statistics courses in many universities. However, this is also a very theoretical book with mathematical formulations, theorems and proofs. If you’re looking for a sound treatment of statistical inference, you can’t go wrong with this one. Though keep in mind that this book is not meant to be read from cover to cover, but rather to be used as a reference. A PDF version is available online.


Institution: Duke University

Name: Statistics with R Specialization

Description: This course is hosted by Coursera and taught by Mine Çetinkaya-Rundel, an assistant professor at the Department of Statistical Science at Duke University. This specialization includes 5 courses (Intro to Probability and Data with R, Inferential Statistics, Linear Regression and Modeling, Bayesian Statistics and Statistics with R Capstone) that covers most of the concepts presented beforehand. The course has an astounding 4.5/5.0 rating across 8665 participants (so far).  

In a future post, we will propose you to address the third pre-requisite: linear algebra.