Transform date format inside CSV using Apache Nifi

Question

I need to modify CSV file in Apache Nifi environment.

My CSV looks like file:

Advertiser ID,Campaign Start Date,Campaign End Date,Campaign Name
10730729,1/29/2020 3:00:00 AM,2/20/2020 3:00:00 AM,Nestle
40376079,2/1/2020 3:00:00 AM,4/1/2020 3:00:00 AM,Heinz
...

I want to transform dates with AM/PM values to simple date format. From 1/29/2020 3:00:00 AM to 2020-01-29 for each row. I read about UpdateRecord processor, but there is a problem. As you can see, CSV headers contain spaces and I can't even parse these fields with both Replacement Value Strategy (Literal and Record Path).

Any ideas to solve this problem? Maybe somehow I should modify headers from Advertiser ID to advertiser_id, etc?

Ryan Ryan · Accepted Answer · 2021-02-18T20:03:30

You don't need to actually make the transformation yourself, you can let your Readers and Writers handle it for you. To get the CSV Reader to recognize dates though, you will need to define a schema for your rows. Your schema would look something like this (I've removed the spaces from the column names because they are not allowed):

{
    "type": "record",
    "name": "ExampleCSV",
    "namespace": "Stackoverflow",
    "fields": [
        {"name": "AdvertiserID", "type": "string"},
        {"name": "CampaignStartDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
        {"name": "CampaignEndDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
        {"name": "CampaignName", "type": "string"}
    ]
}

To configure the reader, set the following properties:

Schema Access Strategy = Use 'Schema Text' property
Schema Text = (Above codeblock)
Treat First Line as Header = True
Timestamp Format = "MM/dd/yyyy hh:mm:ss a"

Additionally you can set this property to ignore the Header of the CSV if you don't want to or are unable to change the upstream system to remove the spaces.

Ignore CSD Header Column Names = True

Then in your CSVRecordSetWriter service you can specify the following:

Schema Access Strategy = Inherit Record Schema
Timestamp Format = "yyyy-MM-dd"

You can use UpdateRecord or ConvertRecord (or others as long as they allow you to specify both a reader and a writer)and it will just do the conversion for you. The difference between UpdateRecord and ConvertRecord is that UpdateRecord requires you to specify a user defined property, so if this is the only change you will make, just use ConvertRecord. If you have other transformations, you should use UpdateRecord and make those changes at the same time.

Caveat: This will rewrite the file using the new column names (in my example, ones without spaces) so keep that in mind for downstream usage.

Transform date format inside CSV using Apache Nifi

1 Answers