Big Data Analysis: Why It’s Necessary and Who Does It?
Table of Contents
What is big data analysis?
There’s no clear definition of what constitutes big data. There’s no volume threshold beyond which ordinary data becomes big. However, it typically involves at least hundreds of gigabytes and hundreds of thousands of rows in databases. Even bigger data is typically regularly updated, modified, and expanded, meaning it’s not only stored but also actively collected. So, we’ve collected big data and stored it. But in this form, it’s just a collection of information that can’t be useful. To be useful, big data analysis is necessary—structuring it and processing it using specialized algorithms to draw specific conclusions.
For example, we have a hypermarket where people buy specific products. Big data is the information about purchases: which products people buy, how often, and in what quantities. Big data analysis is the study of this information to understand which products should be purchased in larger quantities and which should be discontinued altogether. In this situation, big data analysis involves studying product information to obtain results that can help the company develop.
Big data collection and storage
There are many sources of big data for further work. For example:
● Statistics on user behavior on the website and in the app. Which pages they visit, how long they spend choosing a product, and which sections they study most carefully.
● Sales data from cash registers and CRM. What exactly do people buy and for how much?
● Information from equipment sensors. How the machines in the workshop operate, what temperature is maintained in the room, and what channels a person is watching on a smart TV.
● Social surveys. Data on marital status, age, food preferences, etc.
● Data from medical records. Information about the health status of patients.
● CCTV footage. Age and gender of people, their approximate flow at different times of the day, and routes through the sales area.
● Aggregated information from different databases. We take several databases with “small” data and collect everything in one place, turning the data into big data.
Once collected, the data needs to be stored for subsequent analysis. There are three types of storage locations.
Databases (DBs). They are used to store both small and large data. Databases store clearly structured, organized data. Database data is easier to analyze, but it must be cleaned and structured before storage. This is time-consuming and can lead to the loss of data that may seem meaningless now but could be useful in the future.
Big data is typically stored using:
● Classic relational databases: MySQL, PostgreSQL, Oracle. They are reliable, but do not scale well, so they are not suitable for huge data sets that are frequently updated.
● Non-relational databases: MongoDB, Redis. Such databases are less reliable, but much more flexible.
A data warehouse is a complex storage system consisting of multiple databases and tools for processing and structuring them. It often also includes services for data analysis and visualization for users.
Greenplum and ClickHouse are often used to build data warehouses.
A data lake is a large repository containing a wealth of raw, unstructured information. Any data can be loaded into it for later extraction, analysis, and business use. While analyzing the data later is more difficult, loading it requires no analysis or structuring.
Hadoop is typically used to build data lakes.
Data lakes are often used in conjunction with data warehouses or databases. All data is first loaded into the lake, then extracted based on specific criteria, structured, and then stored in the warehouse or database.

Big data analysis and use technologies
The main goal of big data analysis is to help businesses operate correctly and automate certain processes. To achieve this, there are various methods for using and working with big data.
Data blending and integration. Big data is often collected from many different sources. However, it’s not always possible to merge it into a single database: the data is often heterogeneous and difficult to consolidate.
In this case, integration technology is used. This involves both processing and analyzing data. To achieve this, all disparate information is brought into a unified format. The data is supplemented and verified: redundant data is removed, and missing data is imported from other sources. Often, even after this, certain conclusions can be drawn from the data.
Traditionally, data integration uses ETL processes—extraction, transformation, and loading. ETL systems are built on these processes.
Statistical analysis. Statistics is the calculation of data according to specific criteria, resulting in a specific percentage result. Statistics work best with big data, since the larger the sample, the more reliable the result.
When analyzing big data, one can consider:
● Simple percentages, such as the share of loyal customers.
● Average values of data from different groups, for example, the average bill for different categories of customers.
● Correlation, to calculate how changes in one data point affect others. For example, how a customer’s age affects their purchasing power.
As well as some other indicators, depending on business needs.
In the Data Science Specialist Workshop, we teach students the fundamentals of big data analysis, working with databases and data warehouses, Python programming, and other skills necessary for comprehensive data analysis and machine learning.
Machine learning and neural networks. Big data can be used to create automated systems capable of making decisions independently. In their simplest form, these include chatbots that can recognize user responses. In their most complex form, these include large, distributed procurement or production management systems. For such systems to work, they require established behavioral patterns. These patterns are extracted precisely from working with big data. The system looks at how data has changed in the past and acts based on this in the present. Such systems are called neural networks.
Through training, neural networks can be taught to analyze big data. For example, a neural network can be fed thousands of photographs of women and men. It can then learn to identify gender from photos or videos, making it possible to classify consumer behavior.

Predictive analytics. This is the creation of forecasts based on data. For example, we look at customer behavior over the past year and can predict the demand for specific products on a given day. Or we can determine which specific parameters influence customer behavior.
Predictive analytics is used to forecast currency fluctuations, consumer behavior, delivery times in logistics, and company financial performance.
Predictive analytics involves carefully examining big data, then calculating correlations and plotting graphs to predict how a situation will unfold in the future.
Simulation modeling. Predictive analytics helps predict what will happen if nothing changes and the system continues to operate using the same data. Simulation helps answer the question, “What if?” To do this, we build a highly accurate model of the situation based on big data and then change its parameters: increase the price of a product, increase customer traffic, or change the size of a machined part. The model responds and predicts future outcomes: how profits will change, what will happen to customer loyalty, or whether production speed will decrease.
Big data analysis tools

Power BI and Tableau are used to visualize data analysis results. They allow you to create visual charts, graphs, and tables to present analytical results to those less knowledgeable in data analysis.
There are also specialized tools and frameworks for processing big data using various technologies, such as Hadoop, Caffe, and others. These are used for machine learning and complex data analysis, with the choice of tool depending on the company’s technologies and business objectives.
Data analytics jobs
At a more advanced level, data scientists work more with big data analysis methods such as models, neural networks, and visualization, while analysts use statistical analysis and other mathematical methods. But this isn’t a strict rule: analysts often work with visualization, and data scientists with statistics. It depends on the tasks defined by the business.Data engineer. This person builds the very systems used by data analysts and data scientists. They deploy warehouses, configure data cleaning and analysis systems, provide data to analysts upon request, and ensure everything runs smoothly.
Data analyst. This is the person who analyzes big data. Businesses come to them with questions like: “Which product should we eliminate from our product range?” “What determines the average hospital appointment time?”, Which customers buy the most?” The analyst takes the collected data, analyzes it using specialized technologies, and produces a report. Managers and executives then make business decisions based on this report.