0
votes

I have a lambda function in Python that makes some stuff and returns some values that need to be inserted in Redshift. In the lambda, I push the values to Kinesis, which copy them in S3 and after that in Redshift.

The values in lambda are obtained in a string as follows:

final_string = 'a;b;d;c'

Each letter is the value for a different column in the table in Redshift, hence the delimeter is ';'. Then I push the data to Kinesis Stream with:

put_response = kinesis_client.put_record(StreamName = 'PixelTrack',
                                            Data=json.dumps(final_string),
                                           PartitionKey='first')

Then, the kinesis stream feeds a Kinesis Firehose Stream. The file that is generated in S3 with Kinesis Firehose is like (including quotes in the file):

"a;b;c;d;c" 

And finally I copy the data to redshift with the following statement (configured in Kinesis firehose):

copy table
from blabla
BLANKSASNULL 
DELIMITER ';' 
EMPTYASNULL 
NULL AS 'null' 
ESCAPE 
FILLRECORD;

I have managed to make it work and getting the values in Redshift when just one result is buffered in Kinesis (creating a new column in Redshift, though). So, when just one lambda have been executed during the buffer time, Redshift table looks like:

  A        B         C         D     no_info_column
  "a       b         c         d"        <null>

The problem comes when I execute the lambda several times, because I get a file in S3 with the following text in it:

"a,b,c,d" "a1,b1,c1,d1"

And I get in Redshift the error Extra column(s) found because the copy statement is unable to find the rows separation.

I have tried the following stuff without success:

  • Returning the string in lambda
  • Searching how to set a row delimeter in copy (SO question)
  • Converting a list to json instead of a string. Then I had problems with the brackets opening the list
  • Using REMOVEQUOTES in copy statement

My original question were going to be: "How to copy from s3 to redshift with different rows separated by double quotes", but maybe the problem is in my first approach or whatever, so I have decided to make the question a little bit more broad.

So, how could I solve this?

1
Why not cut out kinesis and just insert into Redshift directly from your lambda? You can easily connect to Redshift programmatically in python, and this would save you lots of time and moneyketcham
Are you using Kinesis Streams or Kinesis Firehose? If you are sending to Redshift, then the easier method is to use Kinesis Firehose because it takes care of the whole process for you, automatically buffering and sending content to Redshift.John Rotenstein
I have created a Kinesis Stream, which send the info to Kinesis Firehose (maybe that is the problem?). I am not inserting directly from lambda because I asked and I have been told that insert statements are very inefficient and we are talking of 4 million rows per day that needs to be inserted, which might slow down a lot the database eventuallyJavier Lopez Tomas
@John Rotenstein you gave me the idea of using just Kinesis Firehose instead both, so you can answer the question if you wantJavier Lopez Tomas

1 Answers

1
votes

If you wish to send streaming data to Amazon Redshift, you can use Amazon Kinesis Data Firehose. It has in-built buffering of data based on size (MB) or time (seconds) for batching writes to Amazon Redshift.

You are correct that it is not ideal to perform small INSERT operations on Redshift. It is much better to load data in bulk. So, if you need to load data on a continuous basis, Kinesis Data Firehose gives the best mix of performance.

You mention that "the kinesis stream feeds a Kinesis Firehose Stream". Feel free to write directly from the AWS Lambda function to Kinesis Data Firehose.