I am pushing logs to an S3 bucket via Firehose.
The data has a very simple format:
{
email: "some email",
message: "a log message",
data: "{ /* ...some json */ }"
}
I created this table definition for Athena:
CREATE EXTERNAL TABLE `logs`(
`email` string COMMENT 'from deserializer',
`message` string COMMENT 'from deserializer',
`data` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://USERLOGS/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1583271303')
It works well over single entries, where the s3 file is a single json blob, but the way firehose works it batches entries into files in s3; only the first entry in the batch is being queried.
How do I make it so the entire batch is queried?
I have 100 blobs but can only see 6 because of this.