How to run PySpark on AWS EMR with AWS Lambda

0

votes

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

amazon-web-servicespysparkaws-lambdaamazon-emr

To run PySpark, you use EMR. To launch EMR, you can use various options including the AWS console, awscli, or a Lambda function. You don't have to use Lambda, but you could if it makes sense e.g. the EMR cluster launch is triggered by data arriving in an S3 bucket. - jarmod

Do you have any resources on this that I can refer to? - Sample_friend

Assuming that you use Python, the boto3 library is what you would use to launch an EMR cluster. The boto3 documents explain in more detail. - jarmod

1

votes

You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.

You can access this link on how to initialise the same.

0

votes

What are the processes available to create a EMR cluster:

Using boto3 / AWS CLI / Java SDK

Using cloudformation

Using Data Pipeline

Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

No. It isn’t mandatory to use lambda to create an auto-terminating cluster.

You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref

Note:

Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the resources that are being created will be permanent to some extent.

If you REALLY had to do it this way, you could make an AWS api call to delete the CF stack upon finishing your EMR tasks.

How may I make my PySpark code to run with AWS EMR from AWS Lambda?

You can design your lambda to submit spark job. You can find an example here

In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

How to run PySpark on AWS EMR with AWS Lambda

2 Answers