I am using EMR step functions to analyze data.
I wanted to store the count of the analyzed dataframe to decide whether I can save it as a csv or parquet. I would prefer CSV but if the size is too big, I wont be able to download it and use it on my laptop.
I used the count() method to store it to a int variable limit
When i try using the following code:
coalesce(1).write.format("text").option("header", "false").mode("overwrite").save("output.txt")
It says that:
int doesnt have any attribute called write
Is there a way to write integers or string to a file so that I can open it in my s3 bucket and inspect after the EMR step has run?
Update: I tried the dataframe method as suggested by @Shu, but am getting the following error.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 19396, ip-10-210-13-34.ec2.internal, executor 11): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
What could be the root cause of this?