AWS Glue to ingest a REST API into a Relational Database

    


Intro

AWS Glue Provides a handy serverless platform for moving around data at scale, usually for the purpose of machine learning or other kinds of data analytics. Your data doesn't have to be "Big Data" to qualify - it can be anything really!



Key features of AWS Glue include

  • Spark ETL Jobs
  • Python Shell Jobs
  • Glue Catalog
  • Glue Workflows
  • Interactive Data Vizualisation
In this article, I'll be describing an architecture for ingesting an API into relational store using a combination of Spark ETL and Python Shell jobs, orchestrated by a Glue Workflow

Batch Ingestion for Analytics

In an ideal world, all data sets would be available immediately and streamed in for real time consumption. The reality is that we usually have to compromise for periodic retrieval, as real-time feeds can be costly to subscribe to from external data providers, or costly to maintain internal infrastructure for on internal data

Period data retrieval can either come in the form of polling for new records in a database or via API. This article describes how this can be done if the source is a REST API. Typical examples might include ingesting external public stock market data using REST API at nightly intervals, or collecting a daily batch of transactions from from an orders reporting REST API

Solution at a Glance





The diagram above outlines the approach at a high level.

There are three main components to the design

1) Glue Python Shell Job
2) Glue Python ETL Job 
3) Glue Workflow to 

The Python Shell job is scheduled to run daily to make a REST API call, and save the result in an S3 bucket as a JSON file. The Python ETL job takes the S3 JSON file, transforms it, and loads it into a relational database.  Finally, the Glue Workflow is how it all "hangs together" - and looks after ensuring that the shell job is run at a certain time, and that the ETL job is only invoked if the REST API call job is successful.

Example - Daily Football Matches Feed

To demonstrate this approach in action, we will create a daily process for ingesting football matches data from a REST API into a MySQL relational database. To work through this example, I will assume the following are already in place!


Prerequisites

1) Football REST API Key for the brilliant football-data.org website
2) AWS Account - a bit obious
3) AWS Glue Service Role - this can be created following these instructions


4) S3 Bucket - Make sure the Glue Service role has full Read/Write access to this bucket

5) Relational Database with a table in it called matches with this schema (in MySQL DDL)

create table matches (id int, match_date date, home_team varchar(200), away_team varchar(200), home_score int, away_score int)

6) A Glue Catalogue table for matches pre-crawled from that relational database

7) VPC infrastructure


Glue Python Shell Job

From the AWS Glue Studio, go into Jobs, select the "Python Shell script editor" option and click "Create" to create a new job. The default option for new script with boiler plate will suffice. This is the first kind of Glue Job we will create  - simply a serverless Python shell that doesn't use Spark.




Under Job Details, we need to select the the AWS Glue Service role from the IAM Role drop down. All other settings can be left to default





You can copy and past the following code into the "Script" section, replacing "your_bucket_name" and "your_api_key" as appropriate. 




This script doesn't demonstrate any code craftsmanship - it's just a minimal set of lines to make an API call and store the result in an S3 bucket for demonstration purposes!

import sys
import requests
import json
import boto3
import datetime

uri = 'https://api.football-data.org/v4/matches'
bucketName= 'your_bucket_name' 
headers = { 'X-Auth-Token': 'your_api_key' }
response = requests.get(uri, headers=headers)

todaysmatches= response.json()

s3=boto3.resource('s3')

s3object = s3.Object(bucketName,
    f"football-in/matches-{datetime.datetime.now()}.json")

s3object.put(Body=(bytes(json.dumps(todaysmatches).encode('UTF-8'))))

You can make few trial runs to make sure it works - if it does, you should end up with today's matches in a JSON file in the bucket - with records that look something like this:

{
  "filters": {
    "dateFrom": "2022-10-26",
    "dateTo": "2022-10-27",
    "permission": "TIER_ONE"
  },
  "resultSet": {
    "count": 11,
    "competitions": "BSA,CL",
    "first": "2022-10-26",
    "last": "2022-10-26",
    "played": 2
  },
  "matches": [
    {
      "area": {

If you've gotten this far, then great! The next section involves creating a different type of Glue job for shifting this file from S3 into a relational database - the Spark ETL kind.

Glue ETL Job

We go into the AWS Glue Studio Jobs section once again, but this time we select "Visual with a blank canvas". This time we are electing to create a Spark-based ETL job from the Glue Studio



When presented with a blank screen, the first node in the data pipeline that will be created is  an Amazon S3 Data Source via "Source->Amazon S3". After this has been added to the canvas, go to the Data Source Properties, and select the "S3 location" as the S3 source type, then enter the fully qualified S3 bucket URL for the bucket location that was configured in the earlier Python shell  job. Clicking "Infer Schema" will select the Output schema based on any JSON files already present in that S3 location.


In the next step, we will need to perform a Custom Transformation to "explode" the current results format where a single JSON object includes a nested array of JSON objects. This type of structure will be typical if we are making REST API calls to an endpoint that allows us to fetch large batches of records to ingest from a given time window - eg, a list of daily matches in our example. So if you want to follow the design in this article, you're likely to need a similar operation, too.

After selecting a "Custom Transform" from the "Action" menu, the following code can be pasted into the Code Block  - this take the contents of the matches array and create multiple records for each match in there.

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    from pyspark.sql.functions import explode
    df = dfc.select(list(dfc.keys())[0]).toDF()
    df = df.select(explode(df["matches"]).alias("matches_exp"))
    dyf_exploded = DynamicFrame.fromDF(df, glueContext, "exploded_results")
    return DynamicFrameCollection({"CustomTransform0": dyf_exploded}, glueContext)


    

To progress further in a meaningful way, we now need to infer the dynamic schema of the new match attributes this process will introduce.  To do this select the Data Preview pane on the Custom Transform, click "Start data preview session", and pick the IAM Glue Service role for this task, and set it off. It will take about 1-2 minutes (depending on how many matches took place that day!). After this is done, you can browse the data in the Data Preview tab, but for our purposes, go to the "Output schema" tab and click "Use data preview schema", to propagate the transformed schema:




After the Custom Transformation, we need to use a "Select From Collection" operation to pass the exploded data frame onto the next operation. This operation seems a little pointless, but it's necessary because the Custom Transformation can create multiple "outputs"- Glue won't automatically default to the first one even if there is only one output as in our case. Select "Action->Select from Collection", go to the transform tab, and choose the only frame index available in the drop-down, "0".

The next two final steps are, frankly, clumsy, owing to a deficiency in the Glue Studio. These two steps will be to 1) map object fields onto flat table columns, and 2) dump the flat table to a relational database.

First select "Action->Change Schema", but don't do anything with the Transform initially. Then select the "Target->AWS Glue Data Catalog" node. On the Data Target properties tab, select the appropriate database and matches table from the Glue Catalogue entries mentioned in prerequisites


Now you can revisit the Change Schema operation and the UI will be in a form where you can map the object structure onto a tabular structure. Map as follows :




        matches_exp.score.fulltime.home    ->  home_score
        matches_exp.score.fulltime.away    ->  away_score
        matches_exp.id                     ->  id
        matches_exp.hometeam.tla           ->  home_team
        matches_exp.utcdate                ->  match_date
        matches_exp.awayteam.tla           ->  away_team


The Glue ETL job is also now complete. All that remains is to orchestrate the behaviour of the two different Glue jobs that have been created.

Glue Workflow

The final part of the system is to use a Glue Workflow for coordinating the behaviour of the two jobs. From the Glue homepage, select  Workflows, and then select "Add Workflow" from the workflows page.  


After this is done, an empty workflow will appear on the Workflows page. Select that workflow from the list, and click "Add Trigger" to open the "Add trigger" panel. You can reuse previous triggers as templates or go to the "Add New" tab to create a new trigger as shown here.


After "Add" is clicked, a Trigger node will be added to the empty workflow. Click "Add Node" to select the job that should be run on a schedule - this will be the Glue Shell job that calls the REST API. We should end up with a two node workflow like this once done:




Next, we want a successful API call to be followed up by an invocation of Glue ETL job for ingesting the resultant JSON into the relational database. Select the Football REST API Call node to see the option for creating another Trigger. By default, this new trigger will be conditional on a successful REST API job execution by default, but  this can also be used to make a trigger on failed jobs.



 Once the trigger has been created, we can finally add another node via "Add Node" to run the Glue ETL Job for ingestion. From list of jobs that come up, select the Glue Spark ETL job from the previous section in the tutorial. The final result should look like this:



And there we have it - a complete Glue ETL pipeline that will ingest a batch of daily football match scores from football-data.org into a relational database.

Conclusion

I have described a reusable pattern for ingesting data from REST APIs into a relational database. The proposed design comes with a low and efficient operating cost thanks to the server-less technology. Additionally, by having an entire process defined in Python, it allows a single team to build an end-to-end solution without needing to learn languages that data engineers may not be experts in (like Node)

On the other hand, it would be a nice if the Glue ETL engine provided some ready-made connectors for REST APIs, without having to resort to Python Shell jobs. However, the Glue Workflow framework does demonstrate that these diverse jobs are easily orchestrated and integrated into a common scheduling and monitoring platform.

Now, while I've worked through this example using the visual editing tools, in practice this ought to be done in Cloud Formation or CDK when productionising. For the sake of brevity, I've avoided going into any details about how this can be done, but I can provide some CDK code samples if needed.

One of the challenges you will probably encounter when trying to automate the deployment of this solution is the lack of maturity in the CDK libraries for Glue. Consequently, most of the infrastructure-as-code that I put together around this example had to revert to the less user-friend Cloud Formation constructs. Over time, as Glue uptake increases, I would anticipate better support for CDK.

Comments

Popular posts from this blog

Evolution of Data Engineering over the Last 10 Years

How to Deploy an AWS S3 Hosted React SPA through CloudFront