I have an AWS Glue job written in Python that pulls in the spark-xml library (through the Dependent jars path). I'm using spark-xml_2.11-0.2.0.jar. When I try to output my DataFrame to XML I get an error. The code I'm using is:
applymapping1.toDF().repartition(1).write.format("com.databricks.xml").save("s3://glue.xml.output/Test.xml");
The error I get is:
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o75.save. : java.lang.AbstractMethodError: com.databricks.spark.xml.DefaultSource15.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation; at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at
If I change it to CSV, it works fine:
applymapping1.toDF().repartition(1).write.format("com.databricks.csv").save("s3://glue.xml.output/Test.xml");
Note: When using CSV I don't have to import spark-xml. I think spark-csv is included in AWS Glue's Spark environment.
Any suggestions to what to try?
I've tried various versions of spark-xml:
spark-xml_2.11-0.2.0 spark-xml_2.11-0.3.1 spark-xml_2.10-0.2.0