I'm using a UDF written in python in order to change the base of a number.
So I read a parquet file and write to a parquet file and applying the UDF. Here is the line I run:
input_df.withColumn("origin_base", convert_2_dest_base(input_df.origin_base)).write.mode('overwrite').parquet(destination_path)
This conversion makes spark to utilize a lot of memory and I get this kind of warnings:
17/06/18 08:05:39 WARN TaskSetManager: Lost task 40.0 in stage 4.0 (TID 183, ip-10-100-5-196.ec2.internal, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 4.4 GB of 4.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
and in the end it fails.
Does UDF is not the right approach? Why is it consuming so much memory?
