Streaming live data from HDFS to Hive

Question

I am new to Hadoop ecosystem and self learning it through online articles. I am working on very basic project so that I can get hands-on on what I have learnt.

My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.

My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with. So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS. Please guide me how to achieve this task. Links will be helpful.

I am working on Version: Cloudera Express 5.13.0

Update 1: I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?

How do you deliver your data to HDFS? And why did you choose HDFS as a data buffer? — Lyashko Kirill
As I mentioned above I am learning Hadoop technologies so I do not focus on which tech is best for which use case, my only at present is to learn n see how things work. So I came up with the use case as described. I want to process logs data as mentioned in my comment. Please guide me on how can I process latest data from hdfs to hive n delete old files from hdfs. Currently I am putting all log files within one for, there is no partitioning so as to make it simple end to end. — Programmer
using flume, flume read data from server logs and write it to hdfs. — Programmer
That's very "outdated" setup. Kafka could probably be used in place of Flume — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2020-01-20T21:31:05

how should I move data from HDFS to HIVE table

This isn't how Hive works. Hive is a metadata layer over existing HDFS storage. In Hive, you would define an EXTERNAL TABLE, over wherever Flume writes your data to.

As data arrives, Hive "automatically knows" that there is new data to be queried (since it reads all files under the given path)

what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed

Depends how you've setup Flume. AFAIK, it will checkpoint all processed files, and only pick up new ones.

will hive read new logs which are not processed and ignore the ones which it has already processed?

Hive has no concept of unprocessed records. All files in the table location will always be read, limited by your query conditions, upon each new query.

Bonus: Remove Flume and Scoop. Make your app produce records into Kafka. Have Kafka Connect (or NiFi) write to both HDFS and your RDBMS from a single location (Kafka topic). If you actually need to read log files, Filebeat or Fluentd take less resources than Flume (or Logstash)

Bonus 2: Remove HDFS & RDBMS and instead use a more real-time ingestion pipeline like Druid or Elasticsearch for analytics.

Bonus 3: Presto / SparkSQL / Flink-SQL are faster than Hive (note: the Hive metastore is actually useful, so keep the RDBMS around for that)

Streaming live data from HDFS to Hive

1 Answers