Data Science Prerequisites : programming (5/5)


Knowing various fancy algorithms and statistical concepts is good, but the ability to apply them to solve practical problems is even better. In this day and age, with the amounts of data available and the ever-growing complexity of techniques being used, it is imperative to off-load the computations to a piece of hardware designed for that purpose. But in order to communicate our intent to the machine, we need to be well versed in a programming language. 

There are many languages out there, each of which was developed with a specific goal in mind and is, therefore, better suited for a certain class of tasks. In the data science space, practitioners seem to favor R and Python in particular, mainly due to their flexibility, ease-of-use, and the specialized libraries or packages they provide to carry out data analytics work. This is a key point motivating the choice, as these two languages offer off-the-shelf implementations of various techniques and algorithms that can be used with a simple function call and a few lines of code.
This essentially helps the practitioner focus more on solving the task at hand rather than writing pages of code. There is a large debate in the community about which language is better, and arguments can be made for both options. R is a statistical language and is generally better suited for data analysis, insights extraction, visualization, interactive prototyping and communication. Python on the other hand, as a full-fledged programming language, is the preferred choice for writing production-level code that can be integrated into an already existing code base.
Although it doesn’t have data science functionality built-in, it offers many highly specialized, well-conceived libraries to do data science work. At the end of the day, the answer to the debate is “it doesn’t matter”, the programming language is but a means to an end. That said, Python being the industry standard, will be the language of choice for the purpose of the present post. It is also worth noting that apart from the libraries, which are language-dependent, most other concepts introduced here form a foundational knowledge in computer programming and generalize equally well to other programming languages.

How to define Python?

Python is an interpreted, high-level, general-purpose programming language [Wikipedia]. Its flexibility allows it to be used as a scripting tool for interactive analyses and fast prototyping, but also as a full-fledged programming language to develop programs destined for other tasks such as networking, web development, gaming, etc. Before delving into the specialized data science libraries Python has to offer, it is crucial to take the time to understand the base Python distribution first, along with general programming principles, which will subsequently make the process of picking up new libraries much easier when the time comes.

Beginner to Intermediate in knowledge programming is desired for data science work.

Programming Concepts

  1. Data Types: Integers, Floats, Strings, Booleans … 
  1. Data Structures: Lists, Tuples, Sets, Dictionaries, their similarities, their differences, their use-cases and how to manipulate them. 
  1. Control Flow: If statements, For Loops, While Loops, Break/Pass/Continue statements…
  2. Functions: Function definition, arguments, keyword arguments, lambda expressions…
  1. Input / Output (I/O): Reading/writing from/to files and output formatting.
  1. Errors and Exceptions: How to catch errors and handle exceptions.
  1. Classes (Object-Oriented Programming): How to define and manipulate classes in python. This one is very important because most data science packages in Python are organized in classes. In sklearn for example, LinearRegression is a class, and LinearRegression() creates an object instance of that class, with methods for functionality and attributes for storing values.
  1. Other Modules: These are complementary modules included in the Standard Python Library and can be handy at times. They include the math module (for mathematical functions), the DateTime module (for date and time manipulation), the re module (for regular expressions), the collections module (which offers new containers and data structures better suited for some tasks) … 

For further more

In order to improve your skills, or simply to start learning programming, we propose you some resources, classified by type of source.

YouTube Playlists

Channel: Corey Schafer

Playlist: Python Tutorials

Description: This channel is focused on creating tutorials and walkthroughs for software developers and engineers and covers a variety of topics for different skill levels such as Git, Python, Terminal Commands, SQL, JavaScript etc. The python tutorials playlist is a very comprehensive collection of 143 videos of 10 to 45 minutes. It covers general programming principles and concepts in python including a few specialized packages, some of which are essential to data science (pandas and matplotlib). Corey has an outstanding teaching style and ability to convey complex information in a digestible fashion using clear communication and a plethora of illustrative examples. Whether you are a beginner or an advanced programmer, this channel probably has something new to teach you.


Author: Allen Downey

Title: Think Python

Description: We have mentioned Allen Downey’s books several times in this series, so needless to say, I do believe his books are of high quality. This is a 250-page book introducing the reader to the fundamentals of computer science and specifically the Python programming language, including most of the concepts detailed before. However, I do not recommend doing the exercises in the book because I personally feel they are too time-consuming for what they teach. On the plus side, there are many coding platforms where one can practice python. The book is freely available online.

Author: Official Python Documentation

Title: Python 3 Tutorial

Description: This is not exactly a book, but a complete tutorial offered as a part of the official python documentation. It is probably the best resource to cover all the topics mentioned above and more if needed. Feel free to choose either this or Think Python because they contain similar information and are organized in a similar way too. That said, the teaching style is not the same, so pick whichever works best for you.

Coding Platforms

Platform: Hackerrank 

Name: Python Practice Challenges

Description: Reading books and tutorials about a programming language is a good first step, but more importantly, you need practice. While going through a book or tutorial, have your terminal or some IDE open on the side, and practice while reading. Hackerrank is a platform offering many exercises organized in sections, which you can solve in Python. It’s a great place to practice and has the great benefit of being beginner-friendly. Many practitioners highly recommend it.

Written by Samy Tafasca Curious, eager to learn and a good communicator, Samy is a PhD student in Deep Learning within Octopeek’s Innovation division. Before starting his career in research, he obtained an engineering degree from a French engineering school and a master’s degree in data science from a London University. His various experiences allowed him to develop a solid technical profile with a strong international exposure. He often uses his knowledge to write blog posts, or initiate knowledge sharing efforts with the community. In addition, Samy is also passionate about cooking, travel, technology and vector artwork!