0
votes

I have defined an existing table in redshift with certain columns. This is already being successfully mapped to from one S3 bucket using a jsonpaths definition & COPY FROM JSON command. This all works fine.

But I now have differently formatted data which is stored in a different S3 bucket and would like to also put this data into the SAME table in redshift. However, this new data doesn't have values for all of the columns in this redshift table - I would like to simply default those certain values in the redshift table to none for the new bucket data. My problem is I do not know how best to setup the jsonpaths definition for the NEW S3 bucket to use, so that I can also easily run COPY FROM JSON on the new s3 bucket.

I am trying to avoid having to do pre-processing on the new bucket data, which could involve just adding additional keys (to map to the undesired columns) with empty values.

I have tried simply putting empty strings in the jsonpaths to align with the columns in the redshift which I do not care about putting values in for. However, the AWS redshift documentation indicates that empty strings cause an error to arise during the COPY FROM JSON command. https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json-jsonpaths

Existing redshift table

CREATE TABLE events (
    platform_name VARCHAR(10),
    app_id VARCHAR(10),
    app_version VARCHAR(10)
)

Existing map file for existing bucket

{
    "jsonpaths": [
       "$['device']['platform']['name']",
       "$['application']['app_id']",
       "$['application']['version_code']"
    ]
}

Map file for new bucket, where the data objects only have a value corresponding to name and NOT to app_id nor app_version. But I want to put them into the same redshift table.

{
    "jsonpaths": [
       "$['platform']",
       ???,
       ???
    ]
}

Is one option to just put in locations in the JSON path which I know for certain won't be there in the new format? Would seem odd...

2

2 Answers

1
votes

The best way to approach this problem is to define default values in your table definition.

CREATE TABLE events (
    platform_name VARCHAR(10),
    app_id VARCHAR(10) default 'null',
    app_version VARCHAR(10) default 'null'
)

Keep only the fields available in your dataset inside the map file

{
    "jsonpaths": [
       "$['platform']"
    ]
}

Then choose column while calling the COPY command

COPY events ("platform")
FROM 's3 file location'
json 'map file location'

Hope this helps.

0
votes

Let's take a different approach.

Leave your events table as is, create a new table events2? as

CREATE TABLE events2 (
platform_name VARCHAR(10)
)

Load events2 table with

{
"jsonpaths": [
   "$['platform']"
] 
}

Create a view like the following to see all events data

CREATE VIEW all_events AS
  SELECT platform_name, app_id, app_version
    FROM events
  UNION ALL
  SELECT platform_name, '', '' 
    FROM events2

This is more like a workaround, but as long as it works why not. Would that be a feasible solution for you?