Data Science Prerequisites : programming (5/5)#HumanData
Knowing various fancy algorithms and statistical concepts is good, but the ability to apply them to solve practical problems is even better. In this day and age, with the amounts of data available and the ever-growing complexity of techniques being used, it is imperative to off-load the computations to a piece of hardware designed for that purpose. But in order to communicate our intent to the machine, we need to be well versed in a programming language.
There are many languages out there, each of which was developed with a specific goal in mind and is, therefore, better suited for a certain class of tasks. In the data science space, practitioners seem to favor R and Python in particular, mainly due to their flexibility, ease-of-use, and the specialized libraries or packages they provide to carry out data analytics work. This is a key point motivating the choice, as these two languages offer off-the-shelf implementations of various techniques and algorithms that can be used with a simple function call and a few lines of code.
This essentially helps the practitioner focus more on solving the task at hand rather than writing pages of code. There is a large debate in the community about which language is better, and arguments can be made for both options. R is a statistical language and is generally better suited for data analysis, insights extraction, visualization, interactive prototyping and communication. Python on the other hand, as a full-fledged programming language, is the preferred choice for writing production-level code that can be integrated into an already existing code base.
Although it doesn’t have data science functionality built-in, it offers many highly specialized, well-conceived libraries to do data science work. At the end of the day, the answer to the debate is “it doesn’t matter”, the programming language is but a means to an end. That said, Python being the industry standard, will be the language of choice for the purpose of the present post. It is also worth noting that apart from the libraries, which are language-dependent, most other concepts introduced here form a foundational knowledge in computer programming and generalize equally well to other programming languages.
How to define Python?
Python is an interpreted, high-level, general-purpose programming language [Wikipedia]. Its flexibility allows it to be used as a scripting tool for interactive analyses and fast prototyping, but also as a full-fledged programming language to develop programs destined for other tasks such as networking, web development, gaming, etc. Before delving into the specialized data science libraries Python has to offer, it is crucial to take the time to understand the base Python distribution first, along with general programming principles, which will subsequently make the process of picking up new libraries much easier when the time comes.
- Data Types: Integers, Floats, Strings, Booleans …
- Data Structures: Lists, Tuples, Sets, Dictionaries, their similarities, their differences, their use-cases and how to manipulate them.
- Control Flow: If statements, For Loops, While Loops, Break/Pass/Continue statements…
- Functions: Function definition, arguments, keyword arguments, lambda expressions…
- Input / Output (I/O): Reading/writing from/to files and output formatting.
- Errors and Exceptions: How to catch errors and handle exceptions.
- Classes (Object-Oriented Programming): How to define and manipulate classes in python. This one is very important because most data science packages in Python are organized in classes. In sklearn for example, LinearRegression is a class, and LinearRegression() creates an object instance of that class, with methods for functionality and attributes for storing values.
- Other Modules: These are complementary modules included in the Standard Python Library and can be handy at times. They include the math module (for mathematical functions), the DateTime module (for date and time manipulation), the re module (for regular expressions), the collections module (which offers new containers and data structures better suited for some tasks) …
For further more
In order to improve your skills, or simply to start learning programming, we propose you some resources, classified by type of source.
Channel: Corey Schafer
Playlist: Python Tutorials
Author: Allen Downey
Title: Think Python
Description: We have mentioned Allen Downey’s books several times in this series, so needless to say, I do believe his books are of high quality. This is a 250-page book introducing the reader to the fundamentals of computer science and specifically the Python programming language, including most of the concepts detailed before. However, I do not recommend doing the exercises in the book because I personally feel they are too time-consuming for what they teach. On the plus side, there are many coding platforms where one can practice python. The book is freely available online.
Author: Official Python Documentation
Title: Python 3 Tutorial
Description: This is not exactly a book, but a complete tutorial offered as a part of the official python documentation. It is probably the best resource to cover all the topics mentioned above and more if needed. Feel free to choose either this or Think Python because they contain similar information and are organized in a similar way too. That said, the teaching style is not the same, so pick whichever works best for you.
Name: Python Practice Challenges
Description: Reading books and tutorials about a programming language is a good first step, but more importantly, you need practice. While going through a book or tutorial, have your terminal or some IDE open on the side, and practice while reading. Hackerrank is a platform offering many exercises organized in sections, which you can solve in Python. It’s a great place to practice and has the great benefit of being beginner-friendly. Many practitioners highly recommend it.