In a previous article, we looked at the different stages involved in a typical big data pipeline. In this article we are going to take a deep dive into all of the different Google Cloud Platforms services that can be used to build a big data pipeline. We will also take a look at the Amazon Web Services and Microsoft Azure service equivalents.
The first stage of the big data pipeline is data collection. As mentioned this involves using a data lake as a repository for large amounts of raw data that is collected over time in different formats. Cloud storage is a GCP service that can act as a data lake for your data pipeline. Let’s take a look at what cloud storage is and some of its capabilities.
Cloud storage is Google Cloud’s object-based storage offering that is both scalable and secure. Objects refer to data that can be in the form of images, text, documents, videos etc. Cloud storage uses buckets to store objects and each bucket is associated with a specific Google Cloud project. You are able to read and write objects to the bucket whenever you like. Objects and buckets in cloud storage are encrypted at rest, ensuring that your data is protected. You can also provide your own encryption keys to encrypt your cloud storage buckets and manage these keys using Google Cloud’s Key Management Service. Alternatively, you can opt to use a key management service that sits on-premises offered by a third party vendor.
A key feature of cloud storage is public and private accessibility. You can make your bucket and/or objects accessible to the public, for example, when a website is being hosted from a cloud storage bucket and public access is required so that people on the internet can access the website. Otherwise, users can opt to make their buckets or objects private, therefore limiting the number of people that have access to the bucket to specific users. Another key feature of cloud storage is object versioning that is set up automatically. Versioning allows you to access older versions of your objects when required. The third key feature is object lifecycle management, this allows for the automatic transitioning of your objects to different storage classes depending on how frequently you access your objects, resulting in an overall lower storage cost.
The different storage classes available in cloud storage are associated with different pricing tiers. These storage classes include, standard, nearline, coldline and archive. Each storage class can be used depending on how frequently you require access to your data and the availability of your data. The standard storage class offers you high availability and immediate/frequent access to your data, nearline is an appropriate choice when you only need to access your data once a month, coldline when you access it once and quarter and archive when you only access it once a year.
Now that we know more about cloud storage, let’s take a look at why it is a great choice for a data lake to be used in our big data pipeline. The first reason is that cloud storage offers high performance and durability allowing for the ingestion of high volumes of data with 99.999999999% durability. Cloud storage offers strong read after write consistency making it easy to know when your data is available for processing. The third reason is that cloud storage is cost effective allowing you to move between storage classes depending on how frequently you need to access your data, and when you do access your data it is retrieved with sub second latency using an API. Cloud storage allows for flexible processing allowing for integrations with powerful big data services including, BigQuery, DataProc, DataFlow and AI Platform. A key advantage of cloud storage is that it allows you to store and access all of your data in one place, keeping your data in sync instead of in silos. Cloud storage also allows you to manage which users have access to your data.
It is evident that cloud storage is a great choice for a data lake, providing you with a centralized place to store all kinds of data, including streaming/real-time and batch processing data, to use in your big data pipeline. AWS and Azure both offer similar service offerings to GCP’s cloud storage. AWS has its simple storage service (S3) and Microsoft Azure has its Azure Blob, both offering you object-based storage that can be used as a data lake. In the next article, we will take a look at GCP’s service offerings for data ingestion – the second stage of the big data pipeline.