1
votes

My pipeline is as follows:

Firehose -> Lambda (AWS' Java SDK) -> (S3 & Redshift)

An un-encoded (raw) JSON record is submitted to Firehose. It then triggers a Lambda function which transforms it slightly. Firehose then puts the transformed record into an S3 bucket and into Redshift.

For Firehose to add the transformed data to S3, it requires that the data be Base64 encoded (and Firehose decodes it before adding it to S3).

However, I have a URL within the data that, when decoded, = characters are replaced with their equivalent unicode character (\u003d) due to it being the character that Amazon's Base64 decoder uses as padding.

https://www.[snipped].com/...?returnurl\u003dnull\u0026referrer\u003dnull

How can I retain those = characters within the decoded data?

Note: I've tried using Base64.getUrlEncoder(), but AWS only seems to support Base64.getEncoder().

1
And you're sure this isn't just an artifact of how you're displaying the data?Michael - sqlbot
@Michael-sqlbot I’m sure; I’ve downloaded the S3 file and opened it using many different text editors (all of which were set to UTF-8), but they all replace = with \u003d. Interestingly, the S3 record is also added to Redshift via Firehose, and Redshift does show the = character.Jacob G.
You've said Firehose -> S3 but also Firehose -> Lambda -> S3. It isn't clear to me what role the Lambda code might be playing, but it seems like that is the most likely suspect. \u003d isn't equivalent to = in a utf-8 text file, but it is in JSON and of course the interface to Lambda is always JSON (though irrelevant if the data in and out is always represented in base64). I don't actually understand your setup well enough to know if this is a useful piece of speculation on my part.Michael - sqlbot
@Michael-sqlbot I apologize for not explaining it better. An un-encoded (raw) JSON record is submitted to Firehose. It then triggers a Lambda function which transforms it slightly. Firehose then puts the transformed record into an S3 bucket and into Redshift. Hopefully this helps clear things up slightly!Jacob G.

1 Answers

0
votes

It turns out that HTML escaping was enabled on the JSON library (Gson) that I was using when (de)serializing my Lambda record. To fix it, I just had to disable HTML escaping:

new GsonBuilder().disableHtmlEscaping().create();