We currently have some files stored on an S3 server. The files are log files (.log extension but plain text content) that have been gzipped to reduce disk space. But gzip isn't splittable and now we are looking for a few good alternatives to store/process our files on Amazon EMR.
So what is the best compression or file format to use on log files? I came across avro and SequenceFile, bzip2, LZO and snappy. It's a bit much and I am a bit overwhelmed.
So I would appreciate any insights in this matter.
Data is to be used for pig jobs (map/reduce jobs)
Kind regards