By Jasper Callaerts – Data Engineer
Hi, my name is jasper callaerts and in this blog, I will be telling you all you need to know to start building your etl pipeline in aws. Thanks to vincent’s blog, you already know a lot about the different components that come together in a serverless platform architecture and more recently, olivier’s blog brought some insight in the first step of such an architecture: the data ingestion. This blog will tackle the next steps that are needed to extract, transform & load your data for analytics, reporting & machine learning.
Let’s take a step back…
Before we take a deep dive into ETL, let’s first take a step back and look at where this piece fits into the whole puzzle. As I mentioned above, we already saw an example architecture of a full serverless platform in a previous blog. On the image below, you can see there are a lot of components/services that fit together perfectly, going from ingesting and storing the data into the platform, to transforming and loading this data, in order to eventually query, analyze and visualize the output, for gaining better insights.
As you can see above, the ETL process serves as a bridge between the multiple, raw data inputs and the clean, structured and interpretable output. As you can derive from its name, it is a process that consists of 3 steps: Extract the data from the source, Transform the data in order to structure and clean it & Load the data into the right place to make it easy accessible for analytics & reporting.
I like to make the following comparison. You probably remember being (or still are) in school, having to prepare for an exam and having a big load of study material spread out over textbooks, presentations and exercises. You partied the whole year and didn’t take notes during classes, and now you have to study all this stuff, which contains a lot of unnecessary information, in just one day… I see the ETL process as the smart, orderly kid in your class, who went to every lesson and made notes and summaries of the material. They take the different textbooks, presentations and exercises as input, extract all the necessary parts of the course from this (and ignore what is not needed for the exam). They add the extra comments from the professor (that you don’t have) and they structure the material into lists, bullet points and tables. With a bit of luck, you can get a copy of that kid’s summary. 😉
That is exactly what your ETL process should do. Since this is such a key process in any data platform, Amazon Web Services has introduced its own fully managed service, that does just this. This service is called AWS Glue.
What is AWS Glue?
As mentioned above, AWS Glue is a fully managed, serverless environment where you can extract, transform, and load (ETL) your data. AWS Glue makes it cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. It works well with structured and semi-structured data and has an intuitive console to discover and transform the data, using Spark, Python or Scala.
AWS Glue calls API operations to transform your data, create runtime logs, store your job logic, and create notifications to help you monitor your job runs. The AWS Glue console connects these services into a managed application, so you can focus on creating and monitoring your ETL work. When resources are required, to reduce startup time, AWS Glue uses an instance from its warm pool of instances to run your workload.
AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. For each Glue ETL Job, a new Spark environment is created, which is protected by an IAM role, a VPC, a subnet and a security group.
One of the major abstractions in AWS Glue is the DynamicFrame, which is similar to the DataFrame construct found in SparkSQL and Pandas. However, these DataFrames are limited because they require a schema to be specified before any data is loaded. This doesn’t address the realities of messy data. Because of this, AWS Glue introduces the DynamicFrame. The main difference is that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type. You can easily convert DynamicFrames to and from DataFrames after you resolve any schema inconsistencies.
AWS Glue mainly consists of 3 big parts, with multiple components:
Glue Data Catalog – this is a centralized repository that stores metadata the structure of the data. It is possible to create Glue Databases with Glue Tables by using Glue Crawlers.
Glue Data Integration & ETL – here it is possible to generate Scala or Python code for the Glue Jobs, which can be orchestrated by using Glue Workflows and Glue Triggers.
Glue DataBrew – is a visual data preparation tool that data analysts and data scientists can use to clean and normalize data. You can choose from more than 250 prebuilt transformations to automate data preparation tasks, all without the need to write any code.
In the following section, we will discuss all these components in more detail, together with some more technical functionalities.
A close-up of the different components
In this section, we will take a closer look at the different components that are present in the Glue Ecosystem. All these tools have their own specific task and can be used inside one ETL pipeline, as we will see in the next blog, where we will build our own ETL pipeline using AWS Glue.
Glue Data Catalog
Glue Table – this is the metadata definition that represents your data. It is important to know that this table does not contain the data itself, but it defines the schema of the data, the partitions and the data types.
Glue Database – this is simply a group of associated Glue Tables that are organized together.
Glue Data Source/Target – or Glue Data Store. This is the location where you store your data, that is used as input/output for your ETL job. There are multiple possible Data Sources/Targets, like Amazon S3, JDBC databases supported through Amazon RDS, DynamoDB, MongoDB, etc.
Glue Crawler – this is a tool that connect to a Data Source and crawls through your data, available in that source. It automatically determines the schema and structure of your data and creates a Glue Table in the Glue Database of your choice. This is extremely handy since you don’t have to manually clarify the structure of the data and its types. The crawler does it for you.
Glue Classifier – this is used by the crawler to determine the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, PARQUET, etc.
Glue Connection – this contains the properties that are required to connect to your Data Source. This is only needed when you connect to a database outside your AWS environment (e.g. JDBC, MongoDB). When your Data Source is within AWS (e.g. S3, DynamoDB), this connection isn’t necessary.
Glue Data Integration & ETL
Glue Script – this is the code that you develop to extract data from the Data Source(s), transform it and load it into the Data Target(s). This script can be generated via the Glue console, which uses Spark. You can choose PySpark and Scala as your languages. It is also possible to write your own script and run it in the Glue Job.
Glue Job – this is actually the business logic that is required to perform the ETL work. It runs the Glue Script and contacts the Data Source(s) and Data Target(s). Internally it uses Spark to execute the transformations and EMR/EC2 to execute these applications on a cluster. Glue Jobs offer a lot of extra features/functionalities that you can use.
- Bookmarks: this feature allows you to iteratively process incremental data, without having to process the whole dataset, each time you run the Glue Job. The bookmarks go through your Data Source, check which files have been added or changed since the last run and only feed those files to the Glue Script.
- Autoscaling: this feature allows you to optimize your run costs. Namely, you can select a maximum number of workers/DPUs that you want to use for the Job. The Job then starts with a low number of workers and automatically upscales this number, making a trade-off between runtime and cost.
- Continuous Logging: this feature allows you to choose the level of logging that you want to receive from your Glue Job.
- Libraries: it is also possible to add existing or custom libraries to the Glue Job.
Glue Trigger – this initiates an ETL Job. These triggers can be defined based on a scheduled time or an event.
Glue Visual Editor – this is a (new) graphical interface that makes it easy to create, run and monitor Glue ETL Jobs. You can visually compose data transformation workflows and seamlessly run them on AWS Glue. You can inspect the schema and data results in each step of the job.
Glue Development Endpoint – this is an environment that you can set up and use to develop and test your Glue ETL Scripts. The downside is that these are quite expensive. Other options are to develop and test your code locally and use a Glue API to run your tests on your own machine.
Glue Notebook – this is a web-based environment that you can use to run your PySpark statements.
Glue Workflow – this tool helps you to orchestrate your whole Glue ETL pipeline and literally glue all the pieces together. It contains Jobs, Crawlers and Triggers. It is an easy-to-use tool that allows you to visualize complex ETL flows.
Why & when should you (not) use AWS Glue?
There are a lot of benefits of using AWS Glue. Because it is serverless, it is very easy to set up, scalable and cost-effective. It is integrated across a very wide range of AWS services, so retrieving of input data and storing of processed data is very easy and possible in many different places (S3, Redshift, RDS, …).
As we already learned in one of the previous blogs, Glue is specifically designed for processing large amounts of data performing complex transformations. Because it is based on Apache Spark, it is a very powerful tool for this cause.
Though it is clearly a very good choice to use AWS Glue as your ETL service, it may not always be the best choice (most of the times it is 😉). For example, when you want to perform transformations or enrichment on small batches of data, setting up a Glue job can be an overkill and will result in an unnecessary high cost. For those cases you can use AWS Lambda. This is a very easy to use serverless service, supporting Python, Java, Node.js, .NET, Go and Ruby, which is ideal for very small ETL jobs or functions (e.g. copying and repartitioning data from one bucket to another). Another option is to step away from the serverless approach, and set up an EMR cluster, on which you can run your own custom Spark code.
But, as the title of this blog states: when you ain’t got a clue, use Glue!
So… What’s next?
Now that you mastered in the theory of ETL by AWS Glue, it is time to take it to the next step! As I already briefly mentioned, in the next blog, we will build our own Glue ETL pipeline, based on a real-life case that I implemented myself. Of course, this will be a simplified version of the case, but it will give you the opportunity to get a more practical feel of how to use AWS Glue, when creating an ETL pipeline, and how Glue interacts with other AWS services like S3, Lambda and Athena.
If you’re excited about our content, make sure to follow the InfoFarm company page on LinkedIn and stay informed about the next blog in this series. Interested in how a data platform would look like for your organization? Book a meeting with one of our data architects and we’ll tell you all about it!
Want to start building your own data platform straight away? Take a look at the InfoFarm One Day Data Platform. A reference architecture in both AWS and Azure. We get you going with a fully operational data platform in only one day! More info on our website.