I have users writing AVRO files and I want to use Flume to move all those files into HDFS using Flume. So I can later use Hive or Pig to query/analyse the data.
On the client I installed flume and have a SpoolDir source and AVRO sink like this:
a1.sources = src1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.src1.type = spooldir
a1.sources.src1.channels = c1
a1.sources.src1.spoolDir = {directory}
a1.sources.src1.fileHeader = true
a1.sources.src1.deserializer = avro
a1.sinks.sink1.type = avro
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hostname = {IP}
a1.sinks.sink1.port = 41414
On the hadoop cluster I have this AVRO source and HDFS sink:
a1.sources = avro1
a1.sinks = sink1
a1.channels = c1
a1.channels.c1.type = memory
a1.sources.avro1.type = avro
a1.sources.avro1.channels = c1
a1.sources.avro1.bind = 0.0.0.0
a1.sources.avro1.port = 41414
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.channel = c1
a1.sinks.sink1.hdfs.path = {hdfs dir}
a1.sinks.sink1.hdfs.fileSuffix = .avro
a1.sinks.sink1.hdfs.rollSize = 67108864
a1.sinks.sink1.hdfs.fileType = DataStream
The problem is that the files on HDFS are not valid AVRO files! I am using the hue UI to check whenever the file is a valid AVRO file or not. If I upload an AVRO I file that I generate on my pc to the cluster I can see its contents fine. But the files from flume are not valid AVRO files.
I tried the flume avro client that is included in flume but didn't work because it sends an flume event per line breaking the avro files, that is fixed with the spooldir source using deserializer = avro. So I think the problem is on the HDFS sink when is writing the files.
Using hdfs.fileType = DataStream it writes the values from the avro fields not the whole avro file, losing all the schema information. If I use hdfs.fileType = SequenceFile the files are not valid for some reason.
Any ideas?
Thanks