AWS — Use S3, Glue, Athena, and PostgreSQL for ETL Pipeline

A Practical Guide to AWS Glue

8 min readAug 29, 2023

Introduction

You may notice that the S3 bucket contains raw data files after data extraction, which can be in various formats such as CSV, JSON, and Parquet. To use this data for analytical processes, it needs to be transformed and loaded into a database or data warehouse. In this article, we will explore the ETL process using AWS tools like Glue and loading data to PostgreSQL.

Source: https://mfarache.github.io/mfarache/Query-S3-Athena-AWS-Glue/

The data used for this example is from a separate project involving the Reddit API. The data was extracted from the API service and stored in the Cloud Storage before being loaded into BigQuery. To simplify things, some of the files from that project were transferred to the S3 bucket.

ELT with Reddit API — Part 3 — To the storage. Use GCP tools for data loading automation

Learn how to load data onto Cloud Storage and BigQuery effortlessly and automate the process using Cloud Functions…

lukianovihor.medium.com

What is AWS Glue

The AWS Glue is an ETL service that is fully managed and comes equipped with a Central Metadata Repository known as the Glue Data Catalog. It also features a Spark ETL Engine, which is incredibly user-friendly due to being serverless. Additionally, Glue has a Scheduler that is flexible enough to cater to all your ETL needs.

This service is designed to eliminate entry barriers, making it easier for data engineers to work with ETL pipelines. Together with other AWS services like S3, Redshift, and more, you can kickstart your pipeline in almost no time.

To work with Glue, you need to understand some concepts which we will cover in this article.

S3 — which is the most commonly used data source for AWS Glue;
Data Catalog — a metadata storage;
Databases, Schemas, Tables, Crawlers, and Connectors in the Data Catalog — simplifies data extraction;
Athena — a simple SQL query tool, for working with S3 buckets;
Glue Jobs and ETL — which make up the core of the Glue workflow.

Create S3 bucket

As previously mentioned, I will utilize the data stored in parquet files from my Reddit API project. At present, I am establishing a new S3 bucket named “reddit-api-glue”. Within this bucket, there will be three folders: “data”, “temp-dir”, and “scripts”. The “data” folder will contain the actual data, while the “temp-dir” folder will serve as a temporary directory for Glue files. Lastly, the “scripts” folder is designated for the Glue scripts we will provide.

Currently, I have uploaded all eight files I possess to the “data/comments-database” folder. Once this process is complete, you will be able to view these files within the designated folder.

Data Catalog

AWS Glue is based on the concept of the Data Catalog, which is a persistent metadata store and a managed service. Its purpose is to store, annotate, and share metadata for data querying and transformation. It is important to note that only one Data Catalog is allowed per AWS region.

To access the Data Catalog, go to the AWS console and select Glue from the left panel. From there, you will see various options, but for the purpose of this article, we will focus on Databases.

At this point, I need to create a new database — which is the location with our newly added files.

We can create a new table in the database by using the Data Catalog. The table represents the schema and the simplest way to create it is by using the Crawler. The Crawler reads the schema directly from the files which can be later used for creating the table. Adding the data source, which is the path to the S3 folder with our data, is very easy. Although we can add a scheduler later, it is not required for this guide as we need to run it on demand. I suggest learning more about this function as it allows creating more complex schemas.

Athena

One of the useful AWS services to know about is Athena, which enables us to effortlessly query files from our S3 bucket using simple SQL statements. To get started, we just need to create a new folder called “athena_results” in our original reddit-api-glue bucket. This folder will be utilized to store the results of our queries.

After creating the folder, we can add its query result location in Athena. To do this, we need to go to Query Editor -> Settings and manage Query result and encryption settings. Once done, we can see that the location has been successfully added.

Now we can use the Query editor to test how it works. We can preview the table with a simple SQL query.

SELECT * FROM "comments_database"."comments_database" limit 10;

We can now see a portion of our table. With the help of Athene, we have verified that the previous steps were completed accurately and can proceed further. Athena can be highly beneficial in various scenarios, and I suggest considering it when dealing with S3 buckets.

Glue Connections

Within the Data Catalog, there is a feature called Connections that offers a multitude of pre-configured connectors to databases and other sources. This feature is easy to use as it only requires your working database credentials to get started.

I won’t delve into too much detail as it’s quite straightforward. I personally make all my connections within the Job feature, eliminating the need to create a separate connection. However, if necessary, you can easily add your database.

Glue Jobs

We have reached the ETL portion of Glue, which is the primary reason we are utilizing this service. Glue Jobs consist of a transformation script, data source, and data targets, and are often initiated by triggers that can be scheduled or triggered by events.

Currently, there are a plethora of options available for utilizing Glue, which can be done through AWS Glue Studio. Specifically, low-code visualizations can be used to create simple ETL pipelines. This may be the easiest option for transforming data from one format, such as CSV, to another, such as parquet if basic functionality is all that is needed.

While the low-code format may be convenient, it does have its limitations. You may find yourself restricted in terms of available data sources and options in general. For those looking to gain a deeper understanding of data transformation, it would be advisable to utilize Python scripts or Notebooks for a more robust experience. Below are all the available options currently offered in Glue Studio.

I will use Jupyter Notebook, as it is the easiest way to understand how it actually works. These Notebooks are fully managed and provide you a broad functionality by AWS. So, you can set up your Spark cluster in a couple of minutes and start transformations right away.

Start session

To begin, let’s establish our Spark cluster and initiate the session. The code has already been predetermined, so we simply need to adjust the data. In this instance, I’ll be utilizing only 2 workers. The crucial part of the code is the final lines. Once the task is generated, we gather data from our Data Catalog by utilizing the Database and Table we previously specified. We can effortlessly create a data frame for adjustments with a single line of code.

%idle_timeout 2880
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

dyf = glueContext.create_dynamic_frame.from_catalog(database='comments_database', table_name='comments_database')
dyf.printSchema()

Transform

The method create_dynamic_frame does not return a traditional PySpark data frame. Instead, it generates a dynamic frame which, according to Amazon, is the most efficient way to use Glue. The documentation contains various familiar functions such as selection, joining, and more. Since the changes required depend on your specific use case, I will only select certain columns and rename them for this example.

dyf_upload = dyf.select_fields(['comment_id', 'commentnum_upvotes']).rename_field('comment_id', 'id').rename_field('commentnum_upvotes', 'comments_count')

Upload to Postgres

To start, we must establish a connection to our write database. In this scenario, I will be utilizing an external PostgreSQL database. AWS Glue offers the capability to conduct ETL connections to various sources. Additional details regarding this can be found in the documentation:

Connection types and options for ETL in AWS Glue for Spark

Use these settings for "connectionType" and "connectionOptions" parameters in ETL scripts.

docs.aws.amazon.com

Since we are utilizing Postgres, our connection type will be JDBC, which can also be used for MS SQL Server, MySQL, Oracle Database, MongoDB, and Redshift. The process is relatively straightforward as you simply need to supply your credentials and target table. Moreover, for specific database versions, a JDBC driver may be necessary. In such a case, you can upload it to your S3 bucket and provide its location using the customJdbcDriverS3Path parameter.

connection_postgres = {
    "url": CONN_URL,
    "dbtable": "test",
    "user": USERNAME,
    "password": PASSWORD,
}

Now our connection is prepared. Once it’s done, we can simply use our Glue context to write our dynamic frame to the Postgres table. After all the previous steps, you can view your new data in the table.

glueContext.write_from_options(frame_or_dfc=dyf_upload, connection_type="postgresql",
                               connection_options=connection_postgres)

Conclusions

In this article, we have examined the features of AWS Glue, which is a highly effective ETL tool. This serverless solution allows users to create and deploy pipelines rapidly. It smoothly integrates with the AWS ecosystem, including S3 and data catalogs, as well as external databases.

While this article gives a broad understanding of AWS Glue, I recommend that data engineers explore its workflows and potential further. For example, it offers an excellent operator for Airflow that makes orchestration easier.

I can be found on Linkedin and I am looking forward to connecting with you. Let’s engage in discussions about the intricate world of data science and data engineering.