Move files from a spooling directory to HDFS with flume

Question

I'm implementing a small hadoop cluster for a POC in my company. I'm trying to import files into HDFS with Flume. Each files contains JSON objects like this (1 "long" line per file):

{ "objectType" : [ { JSON Object } , { JSON Object }, ... ] }

"objectType" is the type the objects in the array (ex: events, users, ...).

These files will be processed later by several tasks depending on the "objectType".

I'm using the spoolDir source and the HDFS sink.

My questions are:

Is it possible to keep the source filename when flume write into HDFS (filenames are unique as they contains a timestamp and a UUID in their name)
Is there a way to set "deserializer.maxLineLength" to an unlimited value (instead of setting a high value)?
I really dn't want to loose data. Which channel is the best, JDBC or File? (I do not have a flow with high throughput)

My constraint is that I have to use flume out-of-the-box (no custom elements) as much as possible.

Thanks for your help!

Adam Fokken Adam Fokken · Accepted Answer · 2015-01-03T05:01:47

Is it possible to keep the source filename when flume write into HDFS (filenames are unique as they contains a timestamp and a UUID in their name)

Yes. With the spooldir source, ensure the fileheader attribute is set to true. This will include the the filename with the record.

agent-1.sources.src-1.fileHeader = true

Then for your sink use the avro_event serializer to capture the filename in the header of your avro flume event record.

agent-1.sinks.snk-1.serializer = avro_event

The avro record conforms to this schema. https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/FlumeEventAvroEventSerializer.java#L30

Is there a way to set "deserializer.maxLineLength" to an unlimited value (instead of setting a high value)?

There is no unlimited config for deserializer.maxLineLength. https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/serialization/LineDeserializer.java#L143

I really dn't want to loose data. Which channel is the best, JDBC or File? (I do not have a flow with high throughput)

This will probably depend on the resiliency options you have for your database or filesystem. If you have a redundant database with backup then JDBC. If you have a durable filesystem that is resilient then go file.

Move files from a spooling directory to HDFS with flume

1 Answers