Snappy or LZO for logs then consumed by hadoop

Question

I have a high volume service. I log events. Every few minutes, I zip the logs using gzip and rotate them to S3. From there, we process the logs using Amazon's Hadoop -- elastic mapreduce -- via Hive.

Right now on the servers, we get a CPU spike ever few minutes when we zip and rotate the logs. We want to switch away from gzip to either lzo or snappy to help reduce this cpu spike. We are a cpu-bound service, so we're willing to trade bigger log files for less cpu consumed when we rotate.

I've been doing a lot of reading on LZO and Snappy (aka zippy). One of the advantages of LZO is that it is splittable in HDFS. However, our files are ~15MB zipped via Gzip, so I don't think we'll get up to the 64MB default block size in HDFS, so this shouldn't matter. Even if it did, we should just be able to turn the default up to 128MB.

Right now, I want to try snappy, as it seems to be slightly faster / less resource intensive. Neither seem to be in Amazon's yum repo, so we probably have to custom install / build anyway -- so not much of a tradeoff in terms of engineering time. I've heard some concerns about LZO license, but I think I'm find just installing it on our server if it doesn't go near our code, right?

So, which should I choose? Will one perform better in Hadoop than the other? Has anyone done this with either implementation and have any issues they could share?

Take a look at cloudera blog post. They go in to detail about each one and recommend Snappy. blog.cloudera.com/blog/2011/09/snappy-and-hadoop You can also find the benchmarks of diff compression types here: github.com/ning/jvm-compressor-benchmark/wiki — Dimitry
Thanks. We ended up going with LZO. We compared only compress times, which they're roughly equivalent. We also had a hard time finding a solid snappy command-line tool, which is pretty critical when you need to manually inspect the data occasionally. — John Hinnegan

Eli Eli · Accepted Answer · 2013-03-28T22:49:12

Maybe it's too late, but Python-snappy provides for a command-line tool for snappy compression/decompression:

Compressing and decompressing a file:

$ python -m snappy -c uncompressed_file compressed_file.snappy

$ python -m snappy -d compressed_file.snappy uncompressed_file

Compressing and decompressing a stream:

$ cat uncompressed_data | python -m snappy -c > compressed_data.snappy

$ cat compressed_data.snappy | python -m snappy -d > uncompressed_data

Snappy also consistently decompresses 20%+ faster than lzo, which is a pretty big win if you want it for files you're reading a lot over hadoop. Finally, it's used by Google for stuff like BigTable and MapReduce, which is a really important endorsement for me at least.

Snappy or LZO for logs then consumed by hadoop

1 Answers