Creating an NRT data pipeline with AWS S3, SQS, Lambda & Redshift

Abhik Chakraborty
5 min readJun 18, 2023

--

Amazon Web Services (AWS) is a leading cloud computing platform that offers a wide range of services to help data engineers and scientists efficiently manage and process large amounts of data. With services such as S3, SQS, EMR, Redshift, Lambda, and more, data engineers can easily build and maintain a scalable, cost-effective, and fault-tolerant data pipeline to collect, store, and process data in batch and real-time.

Photo by Campaign Creators on Unsplash

Today, we will be talking about how I built a near real-time (NRT) pipeline with the same while working at a MAANG company.

Problem Statement

During my tenure, I was tasked with replacing an existing AWS Glue job that struggled to handle the influx of small-sized files. Each file contained around 10,000 records, but the sheer volume of data being delivered was quite high — approximately 100 files per minute were written on S3. This rapid data delivery posed a significant hurdle as the Glue job frequently failed due to what we termed the “small file issue.”

Compounding the problem, there was a critical requirement to make the file data available to the machine learning team within just 30 minutes of delivery on S3. The existing pipeline, relying on Glue, attempted to address this by running the job every half hour and loading file in batches based on a timestamp. However, this approach proved to be costly, both in terms of resource utilization and time. The small file issue not only hindered the pipeline’s speed along with frequent failures but also necessitated the addition of more workers, further increasing costs.

Solution

Determined to find an optimal solution, I embarked on a journey of exploration and experimentation. It became evident that the file processing time could be drastically reduced by leveraging the power of libraries like AWS Wrangler. With this realization, I devised a novel approach.

My solution involved the creation of a set of lambda functions, each specifically designed to process a particular type of data file. These lambda functions were seamlessly integrated with Simple Queue Service (SQS) queues, serving as event triggers whenever a new file was uploaded to S3. This architecture enabled the efficient processing of files within minutes of triggering. Let’s see how can one implement the same.

Simplified Diagram of the implementation

Implementation

Prerequisites

For the above, we will mainly require the following:

  • S3: This will be used to stage the data files. The data files have to be written in an S3 destination by the source systems
  • SQS: This will be used as a buffer to store the S3 file events. Basically, every time a file is written, this SQS will store that event
  • Lambda: A lambda function will be hooked with SQS and would be triggered asynchronously for every event to process the file
  • Redshift: This is the destination where the contents of the file would be written

Steps

  1. The files that were getting dropped into the S3 bucket were partitioned by the date. We had discussions with the sourcing team and they agreed to add a new partition based on the type of data since every type of data file had to go to a different Redshift table
  2. For each type, an SQS queue was created and S3 events were sent to those queues with each queue receiving events for a file type using the prefix filter
  3. Segregated Lambda functions were created to process each file type. The code was exactly the same. It read the event coming from SQS and based on the event type, it chose the Redshift stage table to which the data had to be written. After that, it would just write the data into that table within a few seconds and stop. The average processing time for a file was about 25 seconds. The code was written and then hosted on ECR as a Docker Image. The same image was linked to all the Lambda functions.
  4. All the SQS queues had Dead Letter Queues set up. This enabled us to never miss any file even due to a failure.
  5. All the additional processing was done on Redshift where a SQL job was run every 15 minutes to read the stage table, do the required transformations, compare it to the final table and then write the data in it. This would take between 1 to 2 minutes

Advantages

  1. Faster Processing Time: The solution utilized AWS Lambda and SQS, significantly reducing processing time and enabling near real-time availability of data. Files were processed within seconds of triggering, allowing for faster insights and decision-making based on the latest data.
  2. Cost Efficiency: The new solution operated at approximately 1/10th the cost of the AWS Glue job. By leveraging Lambda functions for processing, it optimized expenses and minimized the need for additional infrastructure, resulting in substantial cost savings.
  3. Scalability and Flexibility: The architecture built around Lambda and SQS offered excellent scalability and flexibility. Each Lambda function could independently process files, enabling efficient parallelization and accommodating future growth or fluctuations in data workload.
  4. Improved Reliability: The implemented solution utilized the guaranteed delivery mechanism of SQS, minimizing failures and ensuring reliable processing of each file. This reduced the risk of data loss or processing errors, enhancing the overall integrity of the pipeline.
  5. Simplified Development and Maintenance: Developing Lambda functions for specific file types simplified the development and maintenance process. It provided a clear and modular design, allowing engineers to focus on optimization and value delivery, rather than managing complex infrastructure.
  6. Near Real-time Data Availability: With data reaching the target database within 15 to 20 minutes, the implemented solution met the requirement of near real-time data availability. This facilitated faster analysis, model training, and decision-making based on up-to-date information.
  7. Resource Optimization: By leveraging the lightweight nature of Lambda functions, the solution optimized resource utilization. It eliminated the need for a sophisticated setup like AWS Glue for processing files smaller than 1 GB, resulting in cost savings and more efficient resource allocation.
  8. Enhanced Customization: The modular design of the solution enabled engineers to customize and fine-tune each Lambda function based on specific data file requirements. This customization allowed for optimizing processing logic and tailoring it to unique use cases, enhancing the overall effectiveness and efficiency of the solution.

I hope that reading about this implementation has been helpful. By sharing my experience and the advantages of this solution, I aim to tell about my thought process while designing this and would hope to get detailed feedback and alternate solutions to improve myself. Cheers!

--

--