I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!
3 Answers
Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.
In order to save the data into S3 you can do something like this
import boto3
import json
# Initializes S3 client
s3 = boto3.resource('s3')
tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)
The AWS Glue Python Shell executor has a limit of 1 DPU max. If that's an issue, like in my case, a solution could be running the script in ECS as a task.
You can run about 150 requests/second using libraries like asyncio and aiohttp in python. example 1, example 2.
Then you can distribute your request across multiple ECS tasks or Kubernetes pods using Ray. Here you can find a few examples of what Ray can do for you.
This also allows you to cater for APIs with rate limiting.
Once you've gathered all the data you need, run it through AWS Glue.
Yes, it is possible. You can use Amazon Glue to extract data from REST APIs. Although there is no direct connector available for Glue to connect to the internet world, you can set up a VPC, with a public and a private subnet. In the private subnet, you can create an ENI that will allow only outbound connections for GLue to fetch data from the API. In the public subnet, you can install a NAT Gateway.
Additionally, you might also need to set up a security group to limit inbound connections. Hope this answers your question.