How Trinus built a data lake on AWS for one of its major clients

Ironside Group, a consulting company was looking to build a data lake for one of its customers so that they could make efficient use of data coming from disparate sources. Data lake on AWS S3 is the popular choice for most of the organizations around the world. It is not only popular but also an efficient way of building a data lake because of the object storage and faster retrieval of data capabilities that S3 offers.

Data Lake

Data lakes are becoming tremendously popular with the increasing need for data for business development or analytics. Though data lakes are not meant to replace the traditional data warehouses, depending on the requirements an organization will consider both, the data warehouse and the data lake.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Business Goal

The staging layer was the ultimate goal of building a data lake on S3. Data in this layer was transformed, joined and partitioned from the raw data layer. Different partitions were required for different customers for querying purposes. Customers could directly query from the staging layer.

Data lakes, unlike data warehouses, are meant to store the data in their raw format, i.e. without any careful designs or the need to know what questions you might need answers for in the future. But in this use case, the data layers were divided as a raw layer and the staging layer based on the partitions and few transformations. Raw layer to have the data as-is in the CSV format whereas the staging layer will have data joined, transformed, partitioned and compressed to parquet format. The analytics would be carried on the staging layer.

Various forms of data making up a data lake
Figure 1 – Data Lake on S3

Data Sources

The different data sources included MySQL on AWS, Azure SQL and MongoDB. Customer was looking to store historical data from all these sources and also continuous replication of data into the data lake which will be comprising of raw layer and the staging layer.

Various Services Used:

DMS:

For MySQL database being on AWS, DMS was a perfect service to get the data on S3. AWS Database Migration Service can be used to migrate data both into and out of the cloud for development purposes. One of the major advantages of DMS is continuous replication. So near real-time processing for MySQL and MongoDB data was achievable through DMS.

A replication instance was created to connect to the source data store, read the source data, and format the data for consumption by the target data store. The replication instance is also used to load the data into the target data store. The configuration of the instance was chosen depending upon the volume of the data and the frequency of the changes to capture the CDC.

Through DMS, once the data was into the raw layer of S3 then it was processed using Glue to join and partition data based on customer account id and then stored into the next layer i.e. staging layer of S3

Glue:

The challenge was to get the continuous replicated data from Azure SQL as DMS could only get one time data from Azure. So, AWS Glue played an important role here. Glue is a fully managed, serverless, extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers data with Crawler and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, the data is immediately searchable, queryable and available for ETL.

To extract data from data store, transform the data and then load into the data target, the code was written in PySpark. The joins and the logic for partitioned data were implemented in the PySpark code.

So, for Azure SQL two jobs were built, one to get the data from the data store to the raw layer of S3 and then to the staging layer where the data is joined, transformed and partitioned based on account id.

The near-real-time data using Glue was achievable by scheduling recurrent ETL jobs at a certain frequency. To automate the jobs, a workflow had been created with triggers to maintain the dependencies of the jobs.

S3:

Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake.

The two layers of the data lake were actually divided by two buckets of S3. The raw layer had untransformed data whereas the staging layer had joined, transformed and partitioned data.

The security of the buckets was managed by the IAM users. The accesses to the staging layer bucket were restricted by account ids. That means a customer with account id x would only have access to the folder account id = x and so on.

Athena:

Athena was helpful to analyze the data from the staging layer of S3. Athena integrated with the AWS Glue Data Catalog to have a persistent metadata store for data in Amazon S3. This allowed us to create tables and query the data in the raw or staging layer of S3.

Redshift:

Amazon Redshift is an enterprise-level, petabyte-scale, fully managed data warehousing service. Redshift uses columnar storage for database tables which optimizes analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk. It was used to create a star schema that used data from the staging layer.

Implementing security

One of the most important aspects of any cloud project is security. AWS KMS was used to encrypt S3 buckets. Another challenge was row-level security where IAM inline policies were used to restrict access to specific S3 partitions, Glue databases, and tables.

References
Figure 1 – https://aws.amazon.com/products/storage/data-lake-storage/