I have a number of LZO-compressed log files on Amazon S3, which I want to read from PHP. The AWS SDK provides a nice StreamWrapper for reading these files efficiently, but since the files are compressed, I need to decompress the content before I can process it.
I have installed the PHP-LZO extension which allows me to do lzo_decompress($data)
, but since I'm dealing with a stream rather than the full file contents, I assume I'll need to consume the string one LZO compressed block at a time. In other words, I want to do something like:
$s3 = S3Client::factory( $myAwsCredentials );
$s3->registerStreamWrapper();
$stream = fopen("s3://my_bucket/my_logfile", 'r');
$compressed_data = '';
while (!feof($stream)) {
$compressed_data .= fread($stream, 1024);
// TODO: determine if we have a full LZO block yet
if (contains_full_lzo_block($compressed_data)) {
// TODO: extract the LZO block
$lzo_block = get_lzo_block($compressed_data);
$input = lzo_decompress( $lzo_block );
// ...... and do stuff to the decompressed input
}
}
fclose($stream);
The two TODO
s are where I'm unsure what to do:
- Inspecting the data stream to dtermine whether I have a full LZO block yet
- Extracting this block for decompression
Since the compression was done by Amazon (s3distCp) I don't have control over the block size, so I'll probably need to inspect the incoming stream to determine how big the blocks are -- is this a correct assumption?
(ideally, I'd use a custom StreamFilter directly on the stream, but I haven't been able to find anyone who has done that before)