0
votes

How to import data from Oracle database using spark to dataframe or rdd and then write this data to some hive table?

I have the same code:

public static void main(String[] args) {

    SparkConf conf = new SparkConf().setAppName("Data transfer test (Oracle -> Hive)").setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(sc);

    HashMap<String, String> options = new HashMap<>();
    options.put("url", "jdbc:oracle:thin:@<ip>:<port>:orcl");
    options.put("dbtable", "ACCOUNTS");
    options.put("user", "username");
    options.put("password", "12345");
    options.put("driver", "oracle.jdbc.OracleDriver");
    options.put("numPartitions", "4");

    DataFrame oracleDataFrame = sqlContext.read()
              .format("jdbc")
              .options(options)
              .load();

}

if I create a instance of HiveContext to use hive

HiveContext hiveContext = new HiveContext(sc);

I got the same error:

ERROR conf.Configuration: Failed to set setXIncludeAware(true) for parser oracle.xml.jaxp.JXDocumentBuilderFactory@51be472e:java.lang                                                                                      .UnsupportedOperationException:  setXIncludeAware is not supported on this JAXP implementation or earlier: class oracle.xml.jaxp.JXDocumentBuilderFacto                                                                                      ry
java.lang.UnsupportedOperationException:  setXIncludeAware is not supported on this JAXP implementation or earlier: class oracle.xml.jaxp.JXDocumentBui                                                                                      lderFactory
        at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:614)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2534)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1144)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1116)
        at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:525)
        at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:543)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:437)
        at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:2750)
        at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:2713)
        at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:185)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
        at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:329)
        at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239)
        at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:443)
        at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
        at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:271)
        at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
        at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
        at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:103)
        at replicator.ImportFromOracleToHive.init(ImportFromOracleToHive.java:52)
        at replicator.ImportFromOracleToHive.main(ImportFromOracleToHive.java:76)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
1

1 Answers

0
votes

The issue would appear to be a problem with an outdated Xerces dependency, as detailed in this question. I would guess that you've somehow pulled this in transitively but its impossible to tell without seeing your pom.xml. You'll notice from the stack trace you posted that the error originates in the Hadoop-Common Configuration object, not Spark itself. The solution is to make sure you're using a new enough version.

<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
</dependency>