Process non common csv file with awk using field patterns

Question

My bank sends a non common CSV file using ; as field separator and a binary code (hexadecimal a0 or octal 240) to enclose the fields where a ; could occur, as below:

Input

Extrait;Date;Date valeur;Compte;Description;Montant;Devise
�2020/0001/0002�;29.02.2020;29.02.2020;-;�28/02/20 Some shop in Antwerp     A Antwerpen (BE)�;-16,50;EUR
�2020/0001/0001�;01.02.2020;01.02.2020;-;�31/01/20 Some shop in Zaventem    Z Zaventem (BE)�;-13,00;EUR

I need to process fields 2, 5 and 6 with AWK.

Desired output

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

Up to now, as long as the fields enclosed by � do not contain any ; the script below using the variable FPAT works:

#!/usr/bin/awk -f
BEGIN { 
  FS=";"
  FPAT="[^;]*"                        # this works but not in all cases
  #FPAT="([^;]*)|(\240[^\240]+\240)"  # this doesn't work
}
{ gsub (/\240/, "", $5)               # I wish I could skip this instruction too
  print "{" $2 "}{" $5 "}{" $6 "}" 
}

I found a similar case (see awk FPAT to ignore commas in csv) but changing the , into ; and the \" into \240 didn't do the trick.

I need help for implementing a FPAT pattern to scan correctly my CSV file in all cases.

Note that the csv format isn't a standard and even if the comma as separator and the double quote as protection character are more usual, there's nothing wrong to use a semi-colon and the non-breakable space. Also, take care the non-breakable space, and probably all the file, is written with an ISO8859-1 encoding and not in UTF-8. — Casimir et Hippolyte
@CasimiretHippolyte: OK, I will edit my question in orde to change "non standard" into "non common". I don't know if the file is encoded UTF-8 or ISO8859-1, because I see no letters with accents. — Pierre François
@CasimiretHippolyte: indeed, I saw inside of another file that my bank is encoding according to ISO8859-1. If I convert the file to UTF-8, I get the sequence \xc2\xa0 instead of \xa0, which I can't use in the FPAT proposed by anubhava. I will have to find a workaround... — Pierre François
Nothing forbids to encode the result of anubhava script to UTF-8 after. — Casimir et Hippolyte
Also if you choose to convert your file before, you can change FPAT to [^;\xc2]+(\xc2[^\xa0][^;\xc2]*)*|(\xc2[^\xa0][^;\xc2]*)+ (without the typo) — Casimir et Hippolyte

anubhava anubhava · Accepted Answer · 2020-04-14T10:55:03

You may use this gnu awk with FPAT:

awk -v FPAT='[^;\xa0]+' '{printf "{%s}{%s}{%s}\n", $2, $5, $6}' file

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

-v FPAT='[^;\xa0]+' sets field pattern as 1+ of any character that is not ; and not \xa0.

Process non common csv file with awk using field patterns

1 Answers