I'm trying to build a centralized logging solution using Cloudwatch Subscription Filters to write logs to Kinesis Firehose -> S3 -> AWS Glue -> Athena. I'm running into a lot of issues with data formatting.
Initially, I was using AWS::KinesisFirehose's S3DestinationConfiguration
to write to S3 and then trying to either crawl the data with AWS::Glue::Crawler or create the table manually in the Cloudformation template. I found the Crawler had a lot of trouble determining the data format on S3 (found ION instead of JSON - ION can't be queried by Athena). I'm now trying ExtendedS3DestinationConfiguration
which allows explicit configuration of input and output formats to force it to parquet.
Unfortunately, using this setup Kinesis Firehose returns error logs saying the input is not valid JSON. This makes me wonder if the Cloudwatch Subscription Filter is not writing proper JSON - but there are no configuration options on this object to control the data format.
This is not a particularly unusual problem statement so somebody out there must have a proper configuration. Here are some snippets of my failing configuration:
ExtendedS3DestinationConfiguration:
BucketARN: !Sub arn:aws:s3:::${S3Bucket}
Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
ErrorOutputPrefix: !Sub ${FailedWritePath}
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 128
CloudWatchLoggingOptions:
Enabled: true
LogGroupName: !Sub ${AppId}-logstream-${Environment}
LogStreamName: logs
CompressionFormat: UNCOMPRESSED
RoleARN: !GetAtt FirehoseRole.Arn
DataFormatConversionConfiguration:
Enabled: true
InputFormatConfiguration:
Deserializer:
OpenXJsonSerDe: {}
OutputFormatConfiguration:
Serializer:
ParquetSerDe: {}
SchemaConfiguration:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CentralizedLoggingDatabase
Region: !Ref AWS::Region
RoleARN: !GetAtt FirehoseRole.Arn
TableName: !Ref LogsGlueTable
VersionId: LATEST
Former config:
S3DestinationConfiguration:
BucketARN: !Sub arn:aws:s3:::${S3Bucket}
Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
ErrorOutputPrefix: !Sub ${FailedWritePath}
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 128
CloudWatchLoggingOptions:
Enabled: true
LogGroupName: !Sub ${AppId}-logstream-${Environment}
LogStreamName: logs
CompressionFormat: GZIP
RoleARN: !GetAtt FirehoseRole.Arn
And crawler:
Type: AWS::Glue::Crawler
Properties:
Name: !Sub ${DNSEndPoint}_logging_s3_crawler_${Environment}
DatabaseName: !Ref CentralizedLoggingDatabase
Description: AWS Glue crawler to crawl logs on S3
Role: !GetAtt CentralizedLoggingGlueRole.Arn
# Schedule: ## run on demand
# ScheduleExpression: cron(40 * * * ? *)
Targets:
S3Targets:
- Path: !Sub s3://${S3Bucket}/${S3LogsPath}
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG
TablePrefix: !Sub ${AppId}_${Environment}_
The error, using ExtendedS3DestinationConfiguration
:
"attemptsMade":1,"arrivalTimestamp":1582650068665,"lastErrorCode":"DataFormatConversion.ParseError","lastErrorMessage":"Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream@2ce955fc; line: 1, column: 2]
Seems like there is some configuration issue here but I cannot find it.