1
votes

When I load an xml files in spark-2.2.0 like:

var ac = spark.read.format("xml").option("rowTag", "App").load("/home/sid/Downloads/Files/*.xml")

It is showing me an error:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:549) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:301) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 48 elided

Caused by: java.lang.ClassNotFoundException: xml.DefaultSource at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:533) ... 53 more

2
Have you included/imported com.databricks.spark.xml._? See here: github.com/databricks/spark-xmlShaido
Yeah. I am using spark-shell and importing the dependency when starting spark shell using following bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1Sahil Rohila

2 Answers

4
votes

Here you have to used databricks package for load the XML files. You can load the databricks package using below command with spark-submit or spark-shell.

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1

then you can load as per this.

val df = spark.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "app")
  .load("/home/sid/Downloads/Files/*.xml")

for more information visit this link. https://github.com/databricks/spark-xml

0
votes

@sahil desai - when we have already added the dependency for xml, then why should we create spark context as spark-shell already provides spark session as spark. Wouldn't this be better?

 val df = spark.read.format("xml")
         .option("rowTag", "app")
         .load("/home/sid/Downloads/Files/*.xml")