In a previous article, we looked at the what is big data, some of its properties and the benefits of collecting and using your company’s data to create value in your business and provide insights that can help you with your strategic decision-making process. In this article, we are going to look at all of the stages involved in building a typical big data pipeline.
The properties of big data – variety, velocity and volume – have made the concept of big data pipelines increasingly popular, allowing you to make predictions and insights quickly and effectively. Big data pipelines are built to support one or more of the big data characteristics, helping you streamline the process required to transform your data into a form that can be used to create immense amounts of value for your business. Your pipeline needs to be scalable, so that it can handle varied volumes of data coming in at any rate, and it needs to be able to work with a variety of data in any format, both structured and unstructured data.
In the image above, we can see a typical big data pipeline that many organizations would use as a foundation to start using their big data. The first step in this pipeline is collecting your data. Big data is often stored in the form of a data lake. A data lake is a large repository that can be used to store raw data that will be processed when needed. Data lakes support both structured and unstructured data.
The next stage is the data ingestion stage in which the data from the data lake is transported to a storage medium where the data can be prepared. In the preparation stage, otherwise known as the pre-processing stage, the raw data is transformed into a usable format using data mining and pre-processing tools. This allows for the data to be processed at a later stage and fed into models or even analyzed. Some data preparation techniques include, normalization, imputation of missing values, converting data types or resampling.
The next stage is the computation or data processing stage. Data processing involves actually extracting useful information from the data that has been prepared, so that data scientists and analysts can use it. This can involve feeding your data into an algorithm, machine learning model or even predictive model to make insights from your data. Data processing can be compute intensive, services like Hadoop or Spark are used to distribute the processing workload across different servers to ensure that the compute servers are not overloaded. After processing your data, this is where the data science and advanced analytics steps come in, allowing you to extract meaningful insights from your data using machine learning, predictive modelling, statistical analysis, or even data mining.
The last stage is data presentation, this involves creating dashboards, graphs or even reports that document the insights gathered from your data processing stage so that non-technical stakeholders are able to create value from your insights and use them to make business decisions.
Now that we have taken a look at a typical big data pipeline and the different stages involved, next we can look at how these data pipelines are built onto GCP, AWS and Azure and the different services used for big data in the cloud.
References:
https://www.talend.com/resources/what-is-data-processing/
https://www.sas.com/en_za/insights/big-data/hadoop.html
https://towardsdatascience.com/missing-data-imputation-5c664ad77ef
https://machinelearningmastery.com/statistical-sampling-and-resampling/