Why is Hive creating tables in the local file system

Question

I'm following the getting started guides on the Apache sites for Hadoop and Hive. I have Hadoop configured to run in Pseudo-Distributed Operation. I'm able to run hdfs operations, start beeline, create tables, insert data, and so on. The only problem is that I expect the databases to be stored at /user/hive/warehouse on HDFS, but instead they are created on the local file system at the same path.

Here are my versions and configs:

hadoop@precise64:/data/hadoop-2.8.2/logs$ hadoop version
Hadoop 2.8.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb
Compiled by jdu on 2017-10-19T20:39Z
Compiled with protoc 2.5.0
From source with checksum dce55e5afe30c210816b39b631a53b1d
This command was run using /data/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.8.2.jar
hadoop@precise64:/data/hadoop-2.8.2/logs$ hive --version
Hive 2.3.2
Git git://stakiar-MBP.local/Users/stakiar/Desktop/scratch-space/apache-hive -r 857a9fd8ad725a53bd95c1b2d6612f9b1155f44d
Compiled by stakiar on Thu Nov 9 09:11:39 PST 2017
From source with checksum dc38920061a4eb32c4d15ebd5429ac8a
hadoop@precise64:/data/hadoop-2.8.2/logs$ cat $HADOOP_HOME/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>
hadoop@precise64:/data/hadoop-2.8.2/logs$ cat $HADOOP_HOME/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
   <property>
      <name>hadoop.proxyuser.hive.groups</name>
      <value>*</value>
   </property>
   <property>
      <name>hadoop.proxyuser.hive.hosts</name>
      <value>*</value>
   </property>
   <property>
      <name>hadoop.proxyuser.hadoop.hosts</name>
      <value>*</value>
   </property>
   <property>
      <name>hadoop.proxyuser.hadoop.groups</name>
      <value>*</value>
   </property>
</configuration>
hadoop@precise64:/data/hadoop-2.8.2/logs$ cat $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
   </property>
</configuration>
hadoop@precise64:/data/apache-hive-2.3.2-bin/conf$ cat hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/home/hadoop/tmp</value>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/home/hadoop/tmp/${hive.session.id}_resources</value>
  </property>
  <property>
    <name>hive.querylog.location</name>
    <value>/home/hadoop/tmp</value>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/home/hadoop/tmp/operation_logs</value>
  </property>
</configuration>

Are you using an embedded Hive metastore? Add your hive-site.xml — OneCricketeer
Yes, based on: <property><name>javax.jdo.option.ConnectionDriverName</name>value>org.apache.derby.jdbc.EmbeddedDriver</value></property> The hive-site was based on the template, which is large. I've uploaded it to: dropbox.com/s/ake7my6wtjemiqu/hive-site.xml?dl=0 — Paul Jackson
If you make your own hive site, all the defaults will fallback. The important one is the metastore directory — OneCricketeer
If you would like a fully functional Hadoop environment, I might suggest that you install and configure things via Apache Ambari or Cloudera Manager — OneCricketeer
I replaced the hive-site.xml with a file that sets some directory locations and configures the metastore to use (for now) the embedded derby driver. This worked. Interestingly, it worked immediately, without having to restart hiverserver2 or beeline. I'll migrate to remote databases as well as consider Cloudera. Thanks. — Paul Jackson

OneCricketeer OneCricketeer · Accepted Answer · 2017-11-27T00:49:18

Sounds like you've not configured Hive yet

By default, this is what you get

Metadata is stored in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default, this location is ./metastore_db (see conf/hive-default.xml).

And you're limited in connections

Using Derby in embedded mode allows at most one user at a time.

It's recommended to use a Postgres, MySQL, or Oracle Database as the metastore (a remote metastore)

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase

Why is Hive creating tables in the local file system

1 Answers