I have a high volume service. I log events. Every few minutes, I zip the logs using gzip and rotate them to S3. From there, we process the logs using Amazon's Hadoop -- elastic mapreduce -- via Hive.
Right now on the servers, we get a CPU spike ever few minutes when we zip and rotate the logs. We want to switch away from gzip to either lzo or snappy to help reduce this cpu spike. We are a cpu-bound service, so we're willing to trade bigger log files for less cpu consumed when we rotate.
I've been doing a lot of reading on LZO and Snappy (aka zippy). One of the advantages of LZO is that it is splittable in HDFS. However, our files are ~15MB zipped via Gzip, so I don't think we'll get up to the 64MB default block size in HDFS, so this shouldn't matter. Even if it did, we should just be able to turn the default up to 128MB.
Right now, I want to try snappy, as it seems to be slightly faster / less resource intensive. Neither seem to be in Amazon's yum repo, so we probably have to custom install / build anyway -- so not much of a tradeoff in terms of engineering time. I've heard some concerns about LZO license, but I think I'm find just installing it on our server if it doesn't go near our code, right?
So, which should I choose? Will one perform better in Hadoop than the other? Has anyone done this with either implementation and have any issues they could share?