1
votes

I'm using EMR Notebooks with pyspark and livy.

I'm reading the data from s3 which is in parquet format and string into a pyspark dataframe. there are approx. 2 million rows. when i do a join operation. I am getting 400 session isn't active. for which i have already set the livy timeout to 5h.

An error was encountered: Invalid status code '400' from https://172.31.12.103:18888/sessions/5/statements/20 with error payload: "requirement failed: Session isn't active."

2

2 Answers

1
votes

I had the same issue and the reason for the timeout is the driver running out of memory. By default the driver memory is 1000M when creating a spark application through EMR Notebooks even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook

spark.sparkContext.getConf().get('spark.driver.memory')
1000M

To increase the driver memory just do

%%configure -f 
{"driverMemory": "6000M"}

This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.

0
votes

You can try working your operation on small amount of data first. Once it is working end to end as expected, you can move to large data.