What is Big Data and how does it work?

What is Big Data?
Big Data Parameters
How Big Data Works: Collection, Storage, and Processing
1. Collection
2. Storage
3. Processing
4. Analysis
Benefits of Big Data Technology
Where is big data analytics applied?
The Use of Big Data Around the World
Who works with Big Data?
How to get started with big data
Expert advice

Big Data is an encrypted message for business. Deciphering it will provide precise guidance that will help you chart the right development course. Let’s explore how exactly this works.

What is Big Data?

Big data first gained currency in the mid-2000s, when the volume of information used for calculations and analysis began to grow rapidly. The journal Nature then proposed that big data be defined as data exceeding 150 GB per day. Other experts consider big data to include any volume exceeding 8 GB—the size of standard computer RAM.

However, it’s easier to understand big data through its application. Big data are sets of information that helps make informed decisions; it’s called data-driven. They enable the creation of highly accurate predictive models, which, for example, are used at NASA for detailed planning of future missions. Collecting, storing, and analyzing such volumes of data requires specialized tools and equipment. Instead of a single computer or server, multiple servers or data centers distributed around the world are used.

Big Data Parameters

The main characteristics of Big Data are defined as six “V”s:

● Volume — volume — from 150 GB per day.

● Velocity —speed. The volume and content of Big Data change every second, so collecting and processing it requires high-powered computing. For example, the FlightRadar24 service, which displays all flight routes online.

● Variety. Big data can include photos, videos, and text, files of different sizes and formats, and data from many different sources. Regular data is usually homogeneous, for example, an Excel spreadsheet with the full name of each employee.

● Veracity. Big data is collected only from trusted sources, and accurate and objective methods are used for analysis. Therefore, corporations and international organizations make strategic decisions based on this data.

● Variability. Big data is updated online, so its flow is unstable. It is affected by transmission speed, changing sources, user actions, and even seasonal changes. All these factors must be taken into account and forecasted when analyzing data. For example, air travel data should be used with adjustments for flight delays and weather conditions, which can cause route changes.

● Value. Data alone means nothing, but it can be used to draw insightful conclusions and make informed decisions. For example, analyzing traffic patterns over the course of a year and understanding how best to design public transportation routes.

Big Data also comes in the following forms:

● Structured—that is, already labeled according to specific parameters. For example, data on key patient medical indicators: temperature, blood pressure, blood tests, and ECG.

● Semi-structured, such as files of different formats with records of natural disasters in the region over the past five years.

● Unstructured, such as photos, music, and messages from all social platform users.

Another key difference between big data and regular data is its distributed structure. This means it can be managed and analyzed using multiple microservices.

Visually, this can be represented as a funnel: a lot of data is poured in at the input, and a narrow segment of the required information is received at the output.

When uploading regular data, there’s only one source, and the process is quick and easy. But when it comes to big data, there are multiple sources, the volume of information is large, and a typical funnel may be overwhelmed. Specialized services can help speed up this process: quickly process big data and present it as structured information segments.

How Big Data Works: Collection, Storage, and Processing

Technologies used for big data:

1. Collection

Big data is collected from various sources:

● Social – everything that users post and do on social networks, online services, and apps. This includes photos, videos, audio, instant messaging, geolocations, and hashtags.

● Statistical – all data from government agencies and research companies about people, animals, vehicles, goods and services, political and economic phenomena.

● Medical – data from electronic cards on medical indications, tests, hardware diagnostics, vaccinations, and medical histories.

● Machine – recordings from surveillance cameras, video recorders, control systems, and smart devices.

● Transactional – data on payments and transfers through banks and other financial services.

During the data collection process, data is cleaned, or data cleaning. At this stage, specialized software is used to find, select, and filter data, checking for accuracy and compliance with specified parameters. Data cleaning specialists tag data arrays to make it easier for algorithms to find the necessary information segments in response to user queries.

Data is extracted in whole or in part, starting from the last successful extraction. For example, when generating a weather forecast, only data from the last 24 hours is used. To extract Big Data for a specific period, a special data capture function is used in the storage.

2. Storage

Regular data fits on a single computer or online drive. This isn’t possible with big data, so it’s stored and processed using cloud servers and distributed computing power. This allows multiple people to work with big data simultaneously, accessing it from different locations.

To store big data, use:

● A DWH, or data warehouse, is a single repository for all data used by a company to generate reports and make decisions. Files are grouped by application area and arranged chronologically. For example, an online store’s DWH contains data on all clients, transactions, and departments. Data is received using the ETL (Extract, Transform, Load) principle: first extracted, then transformed, and finally loaded in a single format.

● Data Lakes are data lakes that don’t have a single format or clear structure. The process is as follows: extraction, loading into a database, and transformation into a format suitable for current tasks. A data lake resembles a virtual disk where text, photos, and PDFs are stored, and a database is a table where all of this data is listed.

● DBMSs ( database management systems) are either relational or non-relational. Relational databases are more commonly used for working with big data, as they organize data into tables linked by keys and use a specialized query language, SQL. For example, in the ClickHouse column-based DBMS used by Yandex Metrica, all website traffic data is stored in several tables. Rows represent events, such as page views, and columns represent parameters, such as mobile device traffic. This allows you to generate a website traffic report in seconds, based on several parameters at once, since they are all located in adjacent cells.

Many companies use non-relational DBMSs. These systems store data according to a different, predefined schema rather than transform it into related tables. This allows for quick insertion and retrieval of required information from storage, as well as the launch of high-load applications. For example, Amazon’s DynamoDB is a serverless NoSQL DBMS that supports various query languages. It is used in smart home systems, online games, and advertising services.

3. Processing

Processing large volumes of data using conventional tools would be difficult and time-consuming. For these tasks, specialized software based on MapReduce technology is used. First, the algorithm selects data based on specified parameters, then distributes it among separate nodes, servers, or computers, which then process these data segments simultaneously, in parallel with each other.

Here are examples of services that use MapReduce:

● Hadoop is an open-source service that allows multiple specialists to collect, store, and work with Big Data simultaneously. It automatically redistributes the workload so that if one node fails, others can continue working in its place.
● Apache Spark is a service consisting of several libraries for working with streaming data that is updated at high speed. The data inside can be filtered, processed, and used for machine learning of neural networks.

To monitor the quality of data collection, structure it, and find what they need, some companies hire specialists called DWH analysts.

4. Analysis

To apply big data in your work, you need to analyze it across a wide range of parameters. This can be accomplished with:

● SQL is a query language used when working with relational DBMS.

● Neural networks trained using machine learning to process tons of information in seconds and provide accurate data for the most complex tasks.

To extract the necessary segments of information and transform them into understandable reports and graphs, specialized analytical services based on Business Intelligence (BI) are used. For example, Microsoft’s Power BI is a business analytics service that collects data from CRM systems, Excel spreadsheets, and other sources, and then presents it as interactive reports.

The “Data Science Specialist” course teaches students how to work with Big Data using the most popular tools and technologies. And most importantly, how to extract the most valuable information from raw data and make decisions that will benefit the business.

Benefits of Big Data Technology

Big data is driving the global economy. It helps:

● Work with large volumes of information.
For example, the database of millions of VK users: each of them has hundreds of messages, photos, music, and videos.

● Make more accurate forecasts and more informed decisions.
For example, plan an advertising campaign based on Yandex Metrica data on millions of users and each user’s digital footprint—browser, settings, and websites visited. This would allow, for example, to show auto parts ads only to those who visit automotive websites.

● Instantly respond to failures and vulnerabilities.
With access to big data on all user activity, banks and payment services can immediately track suspicious activity and stop fraudsters.

● Develop long-term strategies.
If a company has data on sales, profits, and losses for several years, analyzing them will help plan investments, personnel management, and product range.

● Fix bugs and improve the product.
Let’s say delivery service specialists notice that in the evening, people often cancel their orders 30 minutes after placing an order. This means customers don’t want to wait long and prefer to go to the store themselves. This problem can be solved by increasing the number of couriers during these hours.

But there are also disadvantages:

● Difficulties with scaling.
For example, a startup’s service has 10,000 users. After an advertising campaign, 1 million new users join. Not every data storage and processing system can handle such a sudden influx. Specialized cloud storage solutions that can be scaled at any time help solve this problem.

● High risks.
Big data increases security requirements. For example, if a major bank’s database is hacked, millions of clients could lose money. To prevent this, companies that own big data use distributed access: different groups of employees have different levels of access and only access to specific database segments. Furthermore, data is encrypted and structured at every level.

● High costs.
Big data requires significant computing power and more expensive storage and processing services. For example, training neural networks requires huge datasets, which are only available to very large corporations and are often unavailable for public use. Working with big data requires the involvement of specialists: data analysts, data warehouse analysts, and BI specialists.

Where is big data analytics applied?

Big data is finding applications in business, the public sector, and industry. Here are the most common areas where big data is used:

● Transport. Using large data on routes and vehicle speeds, navigators offer the shortest route, taking traffic jams into account.

● Mobile communications and the internet. Mobile operators use big data to predict network loads in each zone and understand where there are not enough conventional cell towers and where 5G is needed.

● Medicine. Big data can be used to predict how an epidemic will develop and at what point it will be necessary to increase vaccine production.

● Marketing. For example, Amazon uses a product recommendation system that is trained using big data and brings the service up to 35% of all revenue.

● HR. Based on millions of profiles of successful candidates, an algorithm can be created that will automatically select the most suitable applicants and send them invitations for interviews.

● Manufacturing. Big data helps organize employee work in a way that reduces the risk of accidents and incidents on the shop floor.

● Fintech. Based on data on all fraud cases, banks can create the most secure services for online payments.

● Education. Large data sets help create personalized recommendations of universities for applicants and professions for graduates.

● Internet of Things. Big data is used to make smart devices, sensors, surveillance cameras, and driverless car control systems work even better and make people’s lives easier.

● Science. Research results, surveys, and instrument readings help identify non-obvious patterns and make discoveries in various fields of science.

● Public administration. Big data in the form of statistics helps government agencies and international organizations better allocate resources and respond to problems that are truly relevant to people.

● Artificial intelligence and robots. Using datasets of real dialogues, companies train voice and chatbots that replace technical support or call center staff.

The Use of Big Data Around the World

Big data use is monitored more closely worldwide. While this effectively protects the data of users of various online services, it also significantly limits business opportunities. For example, the European General Data Protection Regulation (GDPR) prohibits Google services from collecting data on family composition or nationality, which would otherwise save small businesses advertising budgets.

In the US, the main consumers and holders of Big Data are large corporations: Apple, Google, Facebook*, and Amazon. However, the government is increasingly restricting its activities and its monopoly on data collection and storage. In some states, such as California, any user can request all data stored in connection with their profile from online services and demand that it be deleted.

In Big Data, data is divided into personal and anonymized. Personal data is data that can be used to clearly identify a user, such as their first and last name, home address, or mobile phone number. Anonymized data includes everything else, such as geolocation or shopping lists, not linked to a specific person or phone number. China has over 200 different laws and regulations designed to protect personal information, including that automatically collected by smartphone apps. Because of this, most foreign social media and international online services are blocked in the country, but this data is accessible to government agencies.

The difficulty is that collecting anonymized data from various sources and comparing it can also yield fairly comprehensive information. For example, taxi and delivery service data can help us understand where people live and work, their approximate income, and their shopping preferences. This is how the world’s leading big data holders make their money.

Who works with Big Data?

Specialists whose work is related to Big Data can be divided into three large groups:

● Infrastructure professions: employees of cloud providers who provide technical data collection and storage, data engineers, and developers of data processing centers.

● Analytical professions: systems, web, and data analysts, marketers. Their job is to process big data to make the service more user-friendly. Thanks to Big Data, YouTube’s recommendation services are so well-adapted to each user and recommend relevant videos.

● AI and machine learning specialists use Big Data to train neural networks and create robotic services based on them.

Gemini Generated Image Wgko6wgko6wgko6w

How to get started with big data

1. Understand open datasets of various content, a list of which can be found on Habr. Even a small 10 TB storage space will help you understand the logic behind structuring and working with data.

2. Learn query and programming languages. SQL and programming languages like Python, R, or Scala are needed to create database queries, structure, and extract the necessary information into visual reports and dashboards.

3. Take courses to gain a more thorough and structured foundation. For example, the “Data Science Specialist” course explains how to use big data technologies for various tasks in science and business. To achieve this, students learn mathematical methods, logical approaches, tools for sorting and visualizing data, and the fundamentals of machine learning. The course is suitable for both specialists with a technical background and those with no IT experience.

Expert advice

Moses Gaspar:
It’s best to begin learning Big Data with data manipulation techniques. Whether you’re working with a table with a million or a thousand rows, the analysis methods remain the same; only the tools may change slightly due to the volume. Understanding how to filter, aggregate, explore, and cleanse data is a foundation that will be useful in any Big Data work.