3
votes

I am trying to copy an avro file to redshift using the COPY command. The file has a column that is of the type:

{'name': 'timestamp', 'type': ['null', {'logicalType': 'timestamp-millis', 'type': 'long'}]}],

Redshift variable type: "timestamp" timestamptz

When I run the following command copy if fails:

COPY table_name 
from 'fil_path.avro' 
iam_role 'the_role' 
FORMAT AS avro 'auto' 

raw field value: 1581306474335

Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF]

However If I add the following line It works:

timeformat 'epochmillisecs'

I tried to put my timestamp in microseconds which should be the base supported epoch resolution but it fails as well, and didn't find an appropriate name (epochmicrosecs didn't seem to do the job). My question is why is it so?

Furthermore I have another field that is causing some problem. A date field which apparently is saved as a number of days in the avro file (7305) that gives the following error:

Redshift variable type: "birthdate" date

avro: 'date_of_birth', 'type': ['null', {'type': 'int', 'logicalType': 'date'}]}

Invalid Date Format - length must be 10 or more

1
I have no clue as of why my solution works. I had the same problem (date as number of days after epoch). I passed the date as text '2020-02-01' and it worked. - Gabber

1 Answers

0
votes

Firstly, about the Time Format:

As Docs states:

COPY command attempts to implicitly convert the strings in the source data to the data type of the target column. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by specifying the following parameters.

First Solution:

Redshift Doesn't Recognize epoch time by default to be able to convert it to the format of TimeStamp as a result it can't extract year, month, day..etc from the epoch time to put them in the TimeStamp Format, as stated by the Docs:

If your source data is represented as epoch time, that is the number of seconds or milliseconds since January 1, 1970, 00:00:00 UTC, specify 'epochsecs' or 'epochmillisecs'.

This is the supported Formats that Redshift can convert Using automatic recognition.

  • TimeStamp needs the format to be as YYYYMMDD HHMISS = 19960108 040809 to be able to extract it right, that's what the error state Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF], while epoch time format is just seconds or milliseconds since January 1, 1970 that it doesn't understand how to extract it's values from.
  • microseconds isn't supported as a parameter for TIMEFORMAT in Redshift.

Second Solution:

  1. You won't need to pass TIMEFORMAT to the COPY command, but you will insert epoch time in your staging tables as VARCHAR or TEXT.
  2. Then, when inserting epoch time from your staging tables into the schema tables convert it like this: TIMESTAMP 'epoch' + epoch_time/1000 * interval '1 second' AS time

Secondly, about date field:

  • DATE data type is specified as Calendar date (year, month, day) as stated by the Docs, As a result it can't be the number of days or be less than 10 characters in length (as 2021-03-04) and that's what the error tell us Invalid Date Format - length must be 10 or more.

The solution for Date field:

  1. You need to do a work-around, by passing the number of days as a VARCHAR or text to your staging tables.
  2. When loading the schema tables from the staging tables, apply Data cleaning by convert number of days to a DATE using TOCHAR: TO_DATE(TO_CHAR(number of days, '9999-99-99'),'YYYY-MM-DD')
  • As a result, number of days will be a valid DATE in your schema tables.