Data Science Prerequisites: probability (1/5)

Probability, in particular, forms the backbone of many data science techniques, thus making it a mandatory prerequisite if one wishes to dive deep into this exciting field.

Data Science is an interdisciplinary and pervasive field that aims to turn data into useful insights. As such, it draws inspiration from different domains and scientific disciplines, and sits at the intersection of applied mathematics, artificial intelligence, information retrieval, databases and many others.

In this blog post, we will go over what probability is, the important role it plays in data science, the fundamental concepts that need to be learned and lastly, we will present a curated list of resources to cover them.

How to define probability?

Probability theory is a mathematical branch dedicated to the study of random phenomena. It assigns a measure to quantify the likelihood that events will occur, which is usually estimated through repeated random sampling. Probability allows us to reason effectively in situations where being certain is impossible. It is heavily used in statistics and also forms the basis of many machine learning algorithms (e.g Probabilistic Graphical Models) and other data science techniques.

Beginner to Intermediate knowledge in probability is desired for data science work.

Probability Concepts

  1. Foundations: Counting, Probability Axioms, Events and Sample Spaces.
  2. Conditional Probability: Conditional Probability, Independence, Bayes rule.
  3. Random Variables: Discrete/Continuous Random Variables, Expected Value, Variance, Higher-Order Moments, Cumulative Distribution Function (CDF), Probability Density/Mass Function (PDF, PMF), Important Probability Distributions (Bernoulli, Binomial, Poisson, Exponential, Uniform, Normal), Joint Distributions.
  4. Limit Theorems : Weak Law of Large Numbers, Central Limit Theorem (CLT), Strong Law of Large Numbers.

For further more

In order to improve your skills, or simply to start learning probabilities, we propose you some resources, classified by type of source.

YouTube Playlists

Channel: mathematicaimonk

Playlist: Probability Primer

Description: This channel is dedicated to postgraduate and upper-level undergraduate mathematics. Probability Primer is a playlist of 43 short videos which will cover most of the concepts outlined above. You may want to check out the first few videos and see if you appreciate the instructor’s way of explaining things.

Channel: Khan Academy

Playlist: Probability

Description: Khan Academy is a non-profit organization with the mission of providing world-class education everywhere and free of charge. It has both a youtube channel and a website. Probability is a playlist of 41 short videos that provides a good introduction to probability and combinatorics. It is highly recommended by the community.


Author: Sheldon Ross

Title: A first Course in Probability

Description: This is the go-to textbook in most graduate courses of probability. It is a very comprehensive book with many exercises (some of them can be relatively advanced and require some mathematical intuition). There is also a PDF summary of the book which can be used as a probability refresher. It is very recommended by the data science community (especially for people with a mathematics background). A PDF version of the book can be found online.

Author: Kevin Murphy

Title: Machine Learning A Probabilistic Perspective

Description: This is a very popular machine learning book, and one of the recommended textbooks in many graduate ML programs. Murphy starts by introducing most probability concepts you need to know for ML in about 30 pages. This is another great summary that is specifically tailored towards machine learning purposes. A PDF version of the book is available online.


Institution: Massachusetts Institute of Technology (MIT)

Name: Introduction to Probability — The Science of Uncertainty

Description: This course is hosted by EDX and taught by John Tsitsiklis and Patrick Jaillet, both of which are professors in the Department of Electrical Engineering and Computer Science at MIT. Consider this course if you want to dive deeper into the world of probability. It is an excellent resource with a lot of positive reviews and is inspired by an actual MIT university course in probability. However, it is somewhat challenging and much longer than the average MOOC. It also covers more than what a data science beginner needs to start the journey. The course requires an estimated 16 weeks of work at 12 hours per week.

In a future post, we will propose you to address the second pre-requisite: statistics.

Written by Samy Tafasca
Curious, eager to learn and a good communicator, Samy is a PhD student in Deep Learning within Octopeek’s Innovation division. Before starting his career in research, he obtained an engineering degree from a French engineering school and a master’s degree in data science from a London University. His various experiences allowed him to develop a solid technical profile with a strong international exposure. He often uses his knowledge to write blog posts, or initiate knowledge sharing efforts with the community. In addition, Samy is also passionate about cooking, travel, technology and vector artwork!