1
votes

I need to store timezone info with my timestamp column in an Redshift external table. I am using the below commands:

Create external table:

CREATE EXTERNAL TABLE schema.test
(
    user_id BIGINT,
    created_by BIGINT,
    created_date TIMESTAMP
)
PARTITIONED BY (application_id varchar(100), aggreate_type varchar(200), created DATE)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS 
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 's3://<>/' 
table properties ('compression_type'='snappy', 'write.parallel'='off', 'write.maxfilesize.mb'='128');

UNLOAD data to S3

UNLOAD ('SELECT 
    user_id
    ,created_by
    ,(cast(created_date as date) || \'T\' || cast(created_date as time) || \'.000Z\') AS created_date
FROM intake.ods_user_resource_role
WHERE res_role_id = \'185814211\'')
TO 's3://<>/'
PARTITION BY (application_id, aggreate_type, created)
access_key_id '<>' secret_access_key '<>'
ENCRYPTED FORMAT AS PARQUET  ALLOWOVERWRITE  PARALLEL OFF  
;

Update table metadata to create partitions:

ALTER TABLE schema.test 
ADD PARTITION (application_id='abc', aggreate_type='xyz', created='2018-07-11') 
LOCATION 's3://<>/application_id=abc/aggreate_type=xyz/created=2018-07-11';

S3 data sample:

{
  "user_id": 12345,
  "created_by": 6789,
  "created_date": "2018-07-11T17:39:53.000Z"
}

Case 1: When using 'created_date TIMESTAMP' in DDL, I am getting below error:

Error: SQL Error [500310] [XX000]: Amazon Invalid operation: Spectrum Scan Error. File 'https://s3.
<file_location>000.parquet has an incompatible Parquet schema for column 's3://<>-

It is happening due to incompatibility between created_date column. Data in S3 is stored as "2018-07-11T17:39:53.000Z" but column is defined as TIMESTAMP (i.e. TIMESTAMP Withiout Timezone)

Case 2: When using 'created_date TIMESTAMP With Time zone' in DDL, I am getting below error:

Error: SQL Error [500310] [42804]: Amazon Invalid operation: Unsupported data type "pg_catalog.timestamptz" for external table creation;

Please let me know if there is a way to store Timestamp with Timezone("2018-07-11T17:39:53.000Z") in Redshift external table.

1

1 Answers

0
votes

I may be wrong, I'd need to run a cluster to check, but I believe the timestamp format "2018-07-11T17:39:53.000Z" is not valid in the first place; Redshift will not accept it.

Next, you are aware that TIMESTAMPTZ does not store timezone information? it applies the timezone information to the timestamp, converting it to UTC, and stores that. The timezone information is thrown away.

As it is, TIMEZONETZ is not available for Spectrum.

You need to use two columns, one for timezone and one for timestmap, and populate the two columns yourself, manually extracting the timezone and timestamp from the incoming timestamp string.