Retail: why is Big Data infrastructure an absolute must?

Big Data. I’m sure you’ve heard it mentioned in tech circles or business conventions the world over. Heard it mentioned in reference to Google or Microsoft or Amazon. The name is self-explanatory, to a degree. But understanding it refers to a large quantity of data doesn’t paint the whole picture. 

And what’s more infuriating, perhaps, is the fact that such a simplistic name can make it intimidating to ask the proper questions, for fear of seeming uninformed in an ever more tech-centric society. But if you don’t ask the right questions, you risk being left behind!

What is Big Data infrastructure?

To understand what Big Data infrastructure is, it is perhaps easier to take a look at its predecessors, to understand the evolution in technology and what makes this such an important paradigm shift in the way businesses are run.

Early infrastructure and the information age

As personal computers began to make their mark on society and companies shifted from paper files, logged in dusty archives and rusty filing cabinets, to floppy disks and hard drives, digital storage replaced physical space, and came at just as much of a premium. Year after year, hard drives evolved at an ever increasing pace, and this was just as well, because databases of customer information, invoices, sales receipts, and purchase orders would compound, requiring more storage. And so, companies were required to purchase bigger towers, filling the role of private servers, and more hard drives to store this information.   

And then, the biggest paradigm shift in modern history occurred: the advent of the Internet and the democratization of information. In 2020, more that 300 billion e-mails were sent and received across the world every day, millions of videos were being streamed on platforms such as YouTube, and the amount of available information has exploded to an unfathomable degree. No single server can process the sheer volume of data that makes its way in ones and zeroes across the fiber optic web that has tied the remotest corners of our world together. 

The shift to Big Data and server farms

Thus, rather than scaling up to a more powerful system, pioneers and early adapters began scaling out, adding a new, equivalent system. Rather than add another hard drive, it became more cost effective, and frankly, necessary, to add a new tower, to network these computers together to share the burden of processing and storing information.

And today, Big Data has become an extension of this shift. Big Data infrastructure, at its core, is nothing more than a series of interconnected servers that allow for the storage and processing of an incredible amount of information. 

What are the elements that comprise Big Data infrastructure?

1. Regrouping your company’s internal data

Your business is a rich source of customer information. Unfortunately, these various pieces of data have historically been stored in separate data banks, known as “information silos”. These silos separate various services and departments in your company, from marketing, to sales and after sales service. Each department gathers information and stores it within its own silo. Thus, the first step is to gather all that information in just one place, where it can be analyzed and put to use, and this is already a great boon for your company. 

All the information you gathered will be centralized in what we refer to as a “data lake”. Simply put, this is a repository where all sorts of formatted and raw data are stored. The name alone is wonderful in its descriptiveness, because it creates this image of this immense body of water, and in this instance, each piece of data is like a droplet of water. Some of the water is crystal clear, and you can imagine it as structured data used for analytics. However, some of it is rather murky as you head further down, and this can be thought of as your raw, binary data, from images or audio files, for example. 

2. Filling the reservoir

The most successful companies have realized that this information, as rich as it may be, is insufficient. So, in order to create a thriving ecosystem, they must collect more data; often this means web scraping, a process in which publically available information on the Internet is collected and stored, whether structured open source data or raw data from any number of websites. This third party data will be saved and stored within your data lake in order to really benefit from all Big Data has to offer.

Retail case study: detecting unhappy customers

It’s one of those rules of business that a happy customer may tell a friend about a good experience, but a dissatisfied customer will tell everyone they know! In fact, only around 4 percent of dissatisfied customers will complain directly to you, but on average, they tell between 9 and 15 friends, family, and colleagues about the experience. 

And that’s why “sentiment analysis” has become such a crucial part of many businesses. This is the practice of scouring various social media platforms, such as Twitter or Facebook, for mentions of a product so any issues can be cleared up as quickly as possible and that dissatisfaction turned to brand loyalty. 

3. Making the data useable

Trying to nip dissatisfaction in the bud is easier said than done when you’re dealing with mentions on social media. It’s not exactly a bastion of proper grammar and clear, text-book speech. As your reservoirs are filled with this information scraped from all corners of the web, it is the job of data scientist to process, clean, and qualify it to make it useable. That means, in the case of sentiment analysis, understanding when negative words are used in an overall positive message, or inversely. But, it also means clearing up any inexactitudes or partial or erroneous information that has been collected.

 Once we have this clean data, we will begin to use various models to enrich it and create inferences that will be passed on to the end customer, the decision makers. We may use simple statistical models or neural networks, but the end result is to create something that will allow us to make market predictions and optimize prospection.  

4. Data visualization for readability

In order to provide these end clients with something that they are able to digest, we call on the interdisciplinary field of data visualization. Simply put, this is a graphic representation summarizing the data through spreadsheets, graphs, or, often, via a dashboard. This field is critically important as we are talking about synthesizing billions of pieces of data in a way that the human mind can process. 

Here is the schematic representation of a Big Data infrastructure deployed by Octopeek in retail.


1) NIFI is a solution that allows data obtained from various sources to be managed. It is capable of interfacing with more than 200 different sources and can process information on the fly. It was released into open source by the NSA and created for collecting data from our phones. 

There are, of course, other means of collecting information, such as Python, and collectively these procedures are referred to as ETL’s (Extract, Transform, Load). It allows information to be collected in any form, transformed into a useable form for your system, and injected into the system.

2) The data lake, as we mentioned, is like a warehouse for data, where it can be stored on a cluster of anything from 10 to 20 machines, or even thousands of them. The other important fact about the data lake is that it allows for the information to be distributed and queried. When it comes to Big Data, there are two major references in databases: Hive (SQL) and HBase (NoSQL). At Google, they use their own relational database management system, known as BigQuery.    

3) Spark allows for data to be stored and processed directly in system memory, and has become the standard for Big Data platforms, today. 

Concretely, what are the advantages of Big Data in the retail market?

First and foremost, you have to realize that the retail market is extremely competitive. Already, in most cities you’ll find a shop selling comparable products to what you, yourself, sell in your brick and mortar store. After all, a business with no competition is probably not filling any market needs. And with the boom in e-commerce, highlighted by the one-time online bookstore come e-commerce giant, Amazon, you’re no longer just competing with the shop up the street, rather with producers all across Europe, the Americas, and Asia.

Big Data isn’t just optional for businesses looking to carve out their corner of the market; it offers an unequaled competitive advantage and, for those who hope to make a splash, it is a necessity. It’s important not to think of it like a solution for retailers; it’s a technological tool that will shift the way you do business.

That said, it’s not something where the advantage is immediately obvious and easily quantified. It’s a complete paradigm shift for businesses

The easiest way to explain it is through an example

Say you run a little bakery. At the end of the day, you count how many loaves of bread and pastries you sold and calculate your turnover. This is known as “analytics”, the processing of past data. At the end of some days, you find yourself running out of products, whereas on other days you end up with dozens of unsold loaves. In both cases, you’re losing value, either by not having product to provide would-be paying customers, or by wasting stock.

With Big Data, you have billions of pieces of analytical data from across the country and using everything from the make-up of your local demographic to weather trends and date and time, any relevant information, we could build models to help optimize production so you are closer day-over-day to providing the exact quantity that you need. No products wasted, no unsatisfied customers. And it can also help you optimize the amount of raw material you order so your shelves are neither bare nor overstocked.

In the bakery example, you have a rather limited number of materials stocking your storeroom and products in your display cases. With retailers, there is more variety in the products you sell and order, more bulky packages in your storeroom, and this paradigm shift is all the more important. You’re no longer trying to work from how many units you have in stock, but taking relevant information to predict what you will need tomorrow. And, in the best case, you will even be able to start to think, “What actions can I make that will result in tomorrow occurring as I’d like.”

With Big Data, we turn to the future, rather than the past, and this allows us to:

  • predict fluctuations in raw materials used in our products
  • amass enough data to monitor our competitors’ prices in real time
  • acquire enough information about our clients to predict their buying habits and thus make effective recommendations. 

Is Big Data infrastructure within reach of any company?

For medium sized businesses, we recommend seeking out a SaaS solution. Simply put, you would not need to develop an in house Big Data infrastructure on your premises. Rather, the management of this infrastructure would be outsourced to a third party, allowing for the costs to be mutualized, lowering the barrier to entry to Big Data to a manageable sum.

For those businesses who cannot manage such an information system internally, there are cloud-based options available: via Amazon, Google, or Microsoft Azure, for example. However, there are reasons to avoid even these solutions, as the price of the solution is often excessively expensive. It is cheaper to host the service internally, for those who are able, but the cloud does have its advantages as well. With a cloud-based solution, you can simply request another server be made available in a matter of minutes (or seconds), rather than the months it may take to order and set up another physical server

What are the prerequisites for setting up Big Data infrastructure?

There is a guiding principle behind Big Data, and this is the data driven approach. In order for this to work, there are two prerequisites:

  1. First, you need data. And a lot of it. Many companies believe that the data they have in house is enough: client information, stock levels, sales histories… The truth is, that’s only the tip of the iceberg. Today, between 80-90% of the information that is used in a data driven approach is sourced externally. That’s the first hurdle; you need a solution to gather the necessary data.

There are multiple ways that that information can be harvested. Today, more and more open source repositories have been made available by governments, which release certain data gathered from public bureaus. A second major source is, of course, the Internet. This is, itself, a second sticking point, because the data available on the web is provided in innumerable different formats.

  1. You must be able to process the gathered data. And this is our second prerequisite. This means that, on top of gathering and storing an incredible amount of information, you must be able to give it a form that fits into your IS. That also means qualifying the quality of the data and being able to verify its validity. The Internet is an amazing repository for human knowledge and data, but not everything available is truthful and complete.

Take for example the simple task of monitoring competitors’ prices. If you were to do it manually, simply copy the information into an Excel file, at some point you may run into a pair of shoes with the list price of a pair of socks, or a pair of socks that run for a million dollars. You’ll have to break this information down and be able to ask, “Is this price within the standard deviation of what can be expected for this product? Are these prices correct?” This type of information and these types of questions are what allow data scientists to give data a reliability score. 

Our economy is built on the backs of small and medium-sized businesses. With the advent of Big Data, we can see that major corporations like GAFA are placed to monopolize the retail market and become the unique interfaces for consumers. It’s critically important that our SMB’s are able to compete on equal footing with the same tools and predictive metrics. 

Do you have very large volumes of data to centralize, enrich, structure and use? Let’s talk about your project!