2
votes

I currently have several csv files where i have no control over about how they are created. Needless to say... they are malformed and not compliant to RFC 4180.

Example input: gist

",0000000000000000";"0";"1115S021121-12-1/2"M"
",0000000000000000";"0";"1115S021122-12-1/2"M"
",0000000000000000";"0";"1115S021123-12-1/2"M"
",0000000000000000";"0";"1115S021124-12-1/2"M"
"1";"1";"EXAMPLE_RANDOM" .    STRING"
"2,0000000000000000";"2";"this;can"also happen"

Desired out:

",0000000000000000";"0";"1115S021121-12-1/2""M"

I have been trying to fix it by running sed over it with a regex. However i only have basic knowledge of regex and sed does not want to play nice with my attempts.

Could someone help me escape the inch quote " inside the double quotes? I know solutions like this are only 99%, i can only rely on the following facts.

  • Delimiter is ;
  • Enclosure is "
  • " can occur multiple times within the quoted text field.

This means a ; or " might occur within the quoted fields. Can someone help me replace the " with ""?

My attempt at regex combining several stackoverflow posts.

 sed -E "s/[^\"](?<!;)\"(?!;|$)/\1"/g" $filename.test2   -> error
 sed "s/[^\"](?<!;)(\")(?!;|$)/\1/g" $filename.test2    -> error 
 ... about 10 more variations, some even without errors but no replaced strings.

If someone has another solution other then regex, any help is much appreciated!

Edit: Thanks to @choroba the perl wizzard. The following fixes the file.

 cat $filename.test | perl -pe 's/(?<=[^;])"(?=[^;])/""/g' >  $filename.test2
2
I can parse ",0000000000000000";"0";"1115S021121-12-1/2""M" - Sam
Are there any double quotes that are already properly escaped in the input? - choroba
There are no double quotes in the input file for the moment. However if they should occur, they need to be escaped also. so thats 4 quotes total. """" - Sam
What about triple quotes? like in ",00000000"""00000000". What should happen for it? - revo
No idea why this was tagged go - Elias Van Ootegem

2 Answers

4
votes

Perl's look-around assertions to the resuce!

perl -pe 's/(?<=[^;])"(?=[^;\n])/""/g' 

I.e. if there's a " not preceded by a ; and not followed by a ;, replace it with "".

2
votes
$ perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>"test.csv",sep=>";",allow_loose_quotes=>1,allow_loose_escapes=>1),always_quote=>1)'
",0000000000000000","0","1115S021121-12-1/2""M"
",0000000000000000","0","1115S021122-12-1/2""M"
",0000000000000000","0","1115S021123-12-1/2""M"
",0000000000000000","0","1115S021124-12-1/2""M"
"1","1","EXAMPLE_RANDOM"" .    STRING"
"2,0000000000000000","2","this;can""also happen"

As choroba noted, with ";" as output sep too:

$ perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>"test.csv",sep=>";",allow_loose_quotes=>1,allow_loose_escapes=>1),always_quote=>1,sep=>";")'
",0000000000000000";"0";"1115S021121-12-1/2""M"
",0000000000000000";"0";"1115S021122-12-1/2""M"
",0000000000000000";"0";"1115S021123-12-1/2""M"
",0000000000000000";"0";"1115S021124-12-1/2""M"
"1";"1";"EXAMPLE_RANDOM"" .    STRING"
"2,0000000000000000";"2";"this;can""also happen"

Explanation added by request:

Text::CSV_XS is a perl module that can parse and generate CSV in a very flexible manner. Using options/attributes to allow badly formatted CSV.

  1. csv (in => "file.csv", ...) reads a file into an internal structure

  2. sep => ";" sets the separator character to ";" instead of the default ","

  3. allow_loose_quotes => 1 and allow_loose_escapes => 1 make it possible to read the incorrect CSV and accept the unescaped nested quotes

  4. csv () returns the reference to the internal structure, which is valid for the outer csv call that generates the output csv (in => csv (in => "file.csv")

  5. The last two arguments set sep to ";" for the output too and cause all field to be quoted, as the OP required

See https://metacpan.org/module/Text::CSV_XS for all options and examples

Formatted in a script, it might look like

use Text::CSV_XS qw( csv );

csv (                    # Outer function
    always_quote => 1,   # Quote all field
    sep          => ";", # Use ; instead of ,
    in           =>      # Input
        csv (            #   comes from inner function
            in                  => "test.csv", # a file
            sep                 => ";",        # ; instead of ,
            allow_loose_quotes  => 1,          # allow ,"foo"bar",
            allow_loose_escapes => 1,          # idem
            )
    );