I'm implementing a small hadoop cluster for a POC in my company. I'm trying to import files into HDFS with Flume. Each files contains JSON objects like this (1 "long" line per file):
{ "objectType" : [ { JSON Object } , { JSON Object }, ... ] }
"objectType" is the type the objects in the array (ex: events, users, ...).
These files will be processed later by several tasks depending on the "objectType".
I'm using the spoolDir source and the HDFS sink.
My questions are:
Is it possible to keep the source filename when flume write into HDFS (filenames are unique as they contains a timestamp and a UUID in their name)
Is there a way to set "deserializer.maxLineLength" to an unlimited value (instead of setting a high value)?
I really dn't want to loose data. Which channel is the best, JDBC or File? (I do not have a flow with high throughput)
My constraint is that I have to use flume out-of-the-box (no custom elements) as much as possible.
Thanks for your help!