Posts

Showing posts from 2022

Evolution of Data Engineering over the Last 10 Years

Over the past 10 years, the field of data engineering has evolved significantly. Some of the key ways in which it has changed include the following: Increased focus on big data and real-time data processing: In recent years, there has been a growing emphasis on technologies and techniques that enable organizations to collect, store, and process large volumes of data in real-time. This has led to the widespread adoption of technologies such as Hadoop, Spark, and NoSQL databases, which are designed to handle big data efficiently. Advancements in machine learning and artificial intelligence: The increasing availability of large datasets and powerful computing resources has led to significant advancements in the field of machine learning and artificial intelligence. This has in turn increased the demand for data engineers who can design and implement systems that can process and analyze data using these technologies. Increased emphasis on data governance and privacy: As organizations co...

AWS Glue to ingest a REST API into a Relational Database

Image
     Intro AWS Glue Provides a handy serverless platform for moving around data at scale, usually for the purpose of machine learning or other kinds of data analytics. Your data doesn't have to be "Big Data" to qualify - it can be anything really! Key features of AWS Glue include Spark ETL Jobs Python Shell Jobs Glue Catalog Glue Workflows Interactive Data Vizualisation In this article, I'll be describing an architecture for ingesting an API into relational store using a combination of Spark ETL and Python Shell jobs, orchestrated by a Glue Workflow Batch Ingestion for Analytics In an ideal world, all data sets would be available immediately and streamed in for real time consumption. The reality is that we usually have to compromise for periodic retrieval, as real-time feeds can be costly to subscribe to from external data providers, or costly to maintain internal infrastructure for on internal data Period data retrieval can either come in the form of polling for new ...