4
votes

I have an AWS Glue job written in Python that pulls in the spark-xml library (through the Dependent jars path). I'm using spark-xml_2.11-0.2.0.jar. When I try to output my DataFrame to XML I get an error. The code I'm using is:

applymapping1.toDF().repartition(1).write.format("com.databricks.xml").save("s3://glue.xml.output/Test.xml");

The error I get is:

"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o75.save. : java.lang.AbstractMethodError: com.databricks.spark.xml.DefaultSource15.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation; at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at

If I change it to CSV, it works fine:

applymapping1.toDF().repartition(1).write.format("com.databricks.csv").save("s3://glue.xml.output/Test.xml");

Note: When using CSV I don't have to import spark-xml. I think spark-csv is included in AWS Glue's Spark environment.

Any suggestions to what to try?

I've tried various versions of spark-xml:

spark-xml_2.11-0.2.0 spark-xml_2.11-0.3.1 spark-xml_2.10-0.2.0

1

1 Answers

0
votes

That question is very similar to (but not an exact duplicate of) Why does elasticsearch-spark 5.5.0 give AbstractMethodError when submitting to YARN cluster? that also deals with AbstractMethodError.


Quoting the javadoc of java.lang.AbstractMethodError:

Thrown when an application tries to call an abstract method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of some class has incompatibly changed since the currently executing method was last compiled.

That pretty much explains what you experience (note the part that starts with "this error can only occur at run time").

I think it's a Spark version mismatch in play here.

Given com.databricks.spark.xml.DefaultSource15 in the stack trace and the change that does the following:

Remove the separated DefaultSource15 due to compatibility in Spark 1.5+

This removes DefaultSource15 and merge it into DefaultSource. This was separated for compatibility in Spark 1.5+ . In master and spark-xml 0.4.x, it dropped 1.x support.

You should make sure that the version of Spark in AWS Glue's Spark environment matches the spark-xml. The latest version of spark-xml 0.4.1 was released on 6 Nov 2016.