0
votes

I have a requirement to interpret and parse a log file whose contents are delimited by comma. The contents of the log file look like below. There is no header column in the file. I am putting it for our convenience.

Event_TimeStamp, Target_IP, UserName, Source_IP, DatabaseName, Activity, QueryText,            Flag
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Connect,                       ,   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select * from table1',   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select * from table2',   0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select colname1,colname2,colname3,colname4 from table1',  0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       Query,   'select colname1,\'some Stringliteral\' as colname2,colname3,colname4 from table2',  0
202101xxxx,       10.2.x.x,     u1,   172.5.x.x,    DB1,       DisConnect,                    ,   0

I have used the convertRecord and configured CSVRecordReader and CSVSetRecordWriter to read/interpret the data and write the data to flow file respectively. The schema for CSV Read is as below.

{
     "type": "record",
     "namespace": "nifi",
     "name": "db",
     "fields": [
       { "name": "Event_Timestamp", "type": "string" },
       { "name": "SourceIP", "type": "string" },
       { "name": "UserName", "type": "string" },
       { "name": "TargetIP", "type": "string" },
       { "name": "DatabaseName", "type": "string" },
       { "name": "Activity", "type": "string" },
       { "name": "QueryText", "type": "string" },
       { "name": "Flag", "type": "int" },      
     ]
} 

The nifi flow is failing in reading the data because the delimiter configured while setting up CSVRecordReader is ","(comma) and the QueryText also contains comma within the text. If I only put the first 3 lines in the new log file and run the flow, then the job is successful since the Querytext doesn't contain any commas within.

I do not know if the CSVReader is the right processor to use here. Can anyone help me how to read a log file that is comma-separated but, one or a few fields has comma's within the text. Any help is highly appreciated.

1
QueryText field values are in single quote (') so configure your CSVRecordReader property Quote Character = ' default is " - Vikramsinh Shinde
Also set property CSV Parser = Jackson CSV - Vikramsinh Shinde
Hi Vikram, It worked like charm. Thanks a lot. Now though we have QueryText field Values in the single quote('), the solution has worked. If the QueryText Field doesn't contain a single quote and the field value contains comma's, is there a way to parse the CSV file in that case? - shankar
If QueryText field value is not enclosed in any quote character, then it would be confusing for CSV parser whether to treat comma as delimiter or part of data. In such case you have to write a custom cleaning script before the ConvertRecord processor to escape comma belongs to data part, but again how does your custom script will know that whether comma is delimiter or part of data. So the source which generates CSV should adhere to at least minimum set of rules of CSV specification. - Vikramsinh Shinde
Thanks a lot Vikram. That answers my question. Appreciate your help. - shankar

1 Answers

1
votes

QueryText field values are in single quote (') so configure your CSVRecordReader property Quote Character = ' default is "

Also set property CSV Parser = Jackson CSV