I faced the same issue with the following version of Spark and Python:
SPARK - 2.4.0
Python - 2.7.5
None of the above solutions worked for me.
For me, the issue was happening while trying to save the result RDD to HDFS location. I was taking the input from HDFS location and saving the same to HDFS location. Following was the code used for the read and write operations when this issue came up:
Reading input data:
monthly_input = sc.textFile(monthly_input_location).map(lambda i: i.split("\x01"))
monthly_input_df = sqlContext.createDataFrame(monthly_input, monthly_input_schema)
Writing to HDFS:
result = output_df.rdd.map(tuple).map(lambda line: "\x01".join([str(i) for i in line]))
result.saveAsTextFile(output_location)
I changed the reading and writing code respectively to below code:
Reading code:
monthly_input = sqlContext.read.format("csv").option('encoding', 'UTF-8').option("header", "true").option("delimiter", "\x01").schema(monthly_input_schema).load(monthly_input_location)
Writing Code:
output_df.write.format('csv').option("header", "false").option("delimiter", "\x01").save(output_location)
Not only this solved the issue, it improved the IO performance by a great deal(Almost 3 times).
But there are one known issue while using the write logic above, which I am yet to figure out a proper solution. If there are blank field in output, due to the CSV encoding, it will show the blank value enclosed in double quotes("").
For me that issue is currently not a big deal. I am loading the output to hive anyway and there the double quotes can be removed while importing itself.
PS: I am still using SQLContext. Yet to upgrade to SparkSession. But from what I tried so far similar read and write operation in SparkSession based code also will work similarly.
show
? – zero323df.rdd.map(lambda x: x).count()
succeeds. – zero323rdd.take(20)
for example executes without a problem? If so the problem may be a header. One way or another can you provide a minimal data sample which can be used to reproduce the problem? – zero323