0
votes

I am having a hive partitioned table txnaggr_rt_fact and it has 2 column partitions txninterval and intervaltype. I am trying to insert a record to this table from spark sql using java. During insert the throws an exception. If the partition doesn't exist as it has to create a new one and fails doing so but If the partition is already existing the record is getting inserted.

Java code is as follows

SparkSession spark = SparkSession
          .builder()
          .appName("Java Spark Hive Example")
          .config("spark.sql.warehouse.dir", "hdfs://localhost:8020/user/hive/warehouse")
          .config("hive.exec.dynamic.partition", "true")
          .config("hive.exec.dynamic.partition.mode", "nonstrict")
          .enableHiveSupport()
          .getOrCreate();

        spark.sql("show databases").show();
        spark.sql("use nadb");
       /* spark.sql("hive.exec.dynamic.partition = true");
        spark.sql("set hive.exec.dynamic.partition.mode=nonstrict");*/
        spark.sql("insert into table txnaggr_rt_fact partition (txninterval = '2017-01-11', intervaltype='Test') values('"+21+"',null,'"+22+"',"+23+")");

The exception when I am inserting to hive partitioned table is as follows

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.NullPointerException: null;
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
        at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:843)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:249)
        at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
        at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:190)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
        at com.cw.na.spark.HiveSqlTest.main(HiveSqlTest.java:76)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3412)
        at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1650)
        at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1579)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:836)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply$mcV$sp(HiveClientImpl.scala:741)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:739)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:739)
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
        at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
        at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
        at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
        at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:739)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply$mcV$sp(HiveExternalCatalog.scala:855)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:843)
        at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:843)
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
        ... 25 more

I found that in my hive-site.xml which is present in spark/conf directory the hive.exec.dynamic.partition.mode was set to strict. I have changed it to nonstrict and even in code I have put the configurations for hive.exec.dynamic.partition.mode and hive.exec.dynamic.partition even then I am getting the same exception.

hive-site.xml

  <property>
    <name>hive.exec.dynamic.partition</name>
    <value>true</value>
    <description>Whether or not to allow dynamic partitions in DML/DDL.</description>
  </property>
  <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
    <description>
      In strict mode, the user must specify at least one static partition
      in case the user accidentally overwrites all partitions.
      In nonstrict mode all partitions are allowed to be dynamic.
    </description>
  </property>
2

2 Answers

0
votes

What if you instead

 spark.sql("insert into table txnaggr_rt_fact partition (txninterval = '2017-01-11', intervaltype='Test') values('"+21+"',null,'"+22+"',"+23+")");

try

 spark.sql("insert overwrite table txnaggr_rt_fact partition (txninterval, intervaltype) values('"+21+"',null,'"+22+"',"+23+",'2017-01-11','Test')");

Here is how I do insert:

sql("create table if not exists table_test (a int, b int) partitioned by (ds string)")
sql("set hive.exec.dynamic.partition.mode=nonstrict")
sql("insert overwrite table_test partition(ds) select a, b, '2017-01-01' from other_table") 
0
votes

first of all you have to create special table because it would be transactional, so by default its off. example:

CREATE TABLE table_name (
  id                int,
  name              string
)
CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC
TBLPROPERTIES ("transactional"="true",
  "compactor.mapreduce.map.memory.mb"="2048",     -- specify compaction map job properties
  "compactorthreshold.hive.compactor.delta.num.threshold"="4",  -- trigger minor compaction if there are more than 4 delta directories
  "compactorthreshold.hive.compactor.delta.pct.threshold"="0.5" -- trigger major compaction if the ratio of size of delta files to
                                                               -- size of base files is greater than 50%
);

then execute some configs like this https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions an huge issue is that you wouldnt inserto into external table that it has already created.