1
votes

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:

  • Files are stored in Hadoop (some kind of file manager)
  • Hive needs tables to store data from Hadoop (data manager)
    • Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?) -> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
  • Can Presto be used without Hive and just on Hadoop directly?

Thanks in advance for answering my questions :)

1

1 Answers

3
votes

First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?

Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).

Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.

Please read more info about Hive connector configuration here and about connector plugins here.