How to copy from s3 to redshift with jsonpaths whilst defaulting some columns to null

Question

I have defined an existing table in redshift with certain columns. This is already being successfully mapped to from one S3 bucket using a jsonpaths definition & COPY FROM JSON command. This all works fine.

But I now have differently formatted data which is stored in a different S3 bucket and would like to also put this data into the SAME table in redshift. However, this new data doesn't have values for all of the columns in this redshift table - I would like to simply default those certain values in the redshift table to none for the new bucket data. My problem is I do not know how best to setup the jsonpaths definition for the NEW S3 bucket to use, so that I can also easily run COPY FROM JSON on the new s3 bucket.

I am trying to avoid having to do pre-processing on the new bucket data, which could involve just adding additional keys (to map to the undesired columns) with empty values.

I have tried simply putting empty strings in the jsonpaths to align with the columns in the redshift which I do not care about putting values in for. However, the AWS redshift documentation indicates that empty strings cause an error to arise during the COPY FROM JSON command. https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json-jsonpaths

Existing redshift table

CREATE TABLE events (
    platform_name VARCHAR(10),
    app_id VARCHAR(10),
    app_version VARCHAR(10)
)

Existing map file for existing bucket

{
    "jsonpaths": [
       "$['device']['platform']['name']",
       "$['application']['app_id']",
       "$['application']['version_code']"
    ]
}

Map file for new bucket, where the data objects only have a value corresponding to name and NOT to app_id nor app_version. But I want to put them into the same redshift table.

{
    "jsonpaths": [
       "$['platform']",
       ???,
       ???
    ]
}

Is one option to just put in locations in the JSON path which I know for certain won't be there in the new format? Would seem odd...

agoyal agoyal · Accepted Answer · 2020-07-01T03:25:49

The best way to approach this problem is to define default values in your table definition.

CREATE TABLE events (
    platform_name VARCHAR(10),
    app_id VARCHAR(10) default 'null',
    app_version VARCHAR(10) default 'null'
)

Keep only the fields available in your dataset inside the map file

{
    "jsonpaths": [
       "$['platform']"
    ]
}

Then choose column while calling the COPY command

COPY events ("platform")
FROM 's3 file location'
json 'map file location'

Hope this helps.

How to copy from s3 to redshift with jsonpaths whilst defaulting some columns to null

2 Answers