Saving Hive tables between different warehouse directories in HDFS via Spark App

Question

as of now, i'm currently figuring out how to save properly a specific hive table that was derived from a mapped source table in a specific database. let's say that the there would be a separate database for both the tester and developer side. how can i segregate the list of tables that they can access from one another?

For now, i monitor the state of the two databases via HUE. Now, I have a spark program that runs on a yarn cluster that creates a table to be stored depending on whether or not he is a developer or a tester.

The spark program that I've just created is a simple app that reads a table from the current warehouse location and saves a new table named new_table

I have the following hive configuration xml such as the following:

<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://xxxx:9083</value>
  </property>
  <property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>300</value>
  </property>
  <!--<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/yyyy/warehouse</value>
  </property>-->
  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>20971520</value>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.smbjoin.cache.rows</name>
    <value>10000</value>
  </property>
  <property>
    <name>hive.server2.logging.operation.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/var/log/hive/operation_logs</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>
  <property>
    <name>hive.exec.reducers.bytes.per.reducer</name>
    <value>67108864</value>
  </property>
  <property>
    <name>hive.exec.copyfile.maxsize</name>
    <value>33554432</value>
  </property>
  <property>
    <name>hive.exec.reducers.max</name>
    <value>1099</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.checkinterval</name>
    <value>4096</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.flush.percent</name>
    <value>0.1</value>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.vectorized.execution.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.vectorized.execution.reduce.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.merge.mapfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.mapredfiles</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.cbo.enable</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion</name>
    <value>minimal</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion.threshold</name>
    <value>268435456</value>
  </property>
  <property>
    <name>hive.limit.pushdown.memory.usage</name>
    <value>0.1</value>
  </property>
  <property>
    <name>hive.merge.sparkfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.smallfiles.avgsize</name>
    <value>16777216</value>
  </property>
  <property>
    <name>hive.merge.size.per.task</name>
    <value>268435456</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication.min.reducer</name>
    <value>4</value>
  </property>
  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.map.aggr.hash.percentmemory</name>
    <value>0.5</value>
  </property>
  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.execution.engine</name>
    <value>mr</value>
  </property>
  <property>
    <name>spark.executor.memory</name>
    <value>996461772</value>
  </property>
  <property>
    <name>spark.driver.memory</name>
    <value>966367641</value>
  </property>
  <property>
    <name>spark.executor.cores</name>
    <value>4</value>
  </property>
  <property>
    <name>spark.yarn.driver.memoryOverhead</name>
    <value>102</value>
  </property>
  <property>
    <name>spark.yarn.executor.memoryOverhead</name>
    <value>167</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.initialExecutors</name>
    <value>1</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.minExecutors</name>
    <value>1</value>
  </property>
  <property>
    <name>spark.dynamicAllocation.maxExecutors</name>
    <value>2147483647</value>
  </property>
  <property>
    <name>hive.metastore.execute.setugi</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.zookeeper.quorum</name>
    <value>xxxx,xxxx</value>
  </property>
  <property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>xxxx,xxxx</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
  <property>
    <name>hive.zookeeper.namespace</name>
    <value>hive_zookeeper_namespace_hive</value>
  </property>
  <property>
    <name>hive.cluster.delegation.token.store.class</name>
    <value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
  </property>
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.server2.use.SSL</name>
    <value>false</value>
  </property>
  <property>
    <name>spark.shuffle.service.enabled</name>
    <value>true</value>
  </property>
</configuration>

Based from my current understanding, If i change the warehouse location to something upon submitting the spark app on the yarn cluster via hive.warehouse.dir using --files /file/hive-site.xml such as the value of hdfs:/user/diff/warehouse, the hive configurations on the spark app should detect the following hive tables that exist on the specific directory.

However, upon doing so, it still persists to the location of the default database of the hive.metastore.uris which points to the directory hdfs:/user/hive/warehouse. Based from my understanding, the hive.metastore.uris overrides the database location in hive.metastore.dir.

What am I doing wrong at this point? is there something i need to properly configure in Hive-site.xml? any answers would be appreciated. Thank you. I'm currently a novice developer when it comes to spark and hadoop.

David דודו Markovitz David דודו Markovitz · Accepted Answer · 2017-03-30T17:04:59

Create separate databases

Demo

Creating the databases is a one time thing

hive> create database dev_db location '/user/hive/my_databases/dev';
hive> create database tst_db location '/user/hive/my_databases/tst';

When you create the table you choose the database you want to work with

hive> create table dev_db.my_dev_table (i int);
hive> create table tst_db.my_tst_table (i int);

hive> desc formatted dev_db.my_dev_table;

# col_name              data_type               comment             
         
i                       int                                         
         
# Detailed Table Information         
Database:               dev_db                   
...                  
Location:               hdfs://quickstart.cloudera:8020/user/hive/my_databases/dev/my_dev_table  
...

hive> desc formatted tst_db.my_tst_table;

Database:               tst_db                   
...              
Location:               hdfs://quickstart.cloudera:8020/user/hive/my_databases/tst/my_tst_table  
...

Saving Hive tables between different warehouse directories in HDFS via Spark App

1 Answers

Demo