I currently have several csv files where i have no control over about how they are created. Needless to say... they are malformed and not compliant to RFC 4180.
Example input: gist
",0000000000000000";"0";"1115S021121-12-1/2"M"
",0000000000000000";"0";"1115S021122-12-1/2"M"
",0000000000000000";"0";"1115S021123-12-1/2"M"
",0000000000000000";"0";"1115S021124-12-1/2"M"
"1";"1";"EXAMPLE_RANDOM" . STRING"
"2,0000000000000000";"2";"this;can"also happen"
Desired out:
",0000000000000000";"0";"1115S021121-12-1/2""M"
I have been trying to fix it by running sed over it with a regex. However i only have basic knowledge of regex and sed does not want to play nice with my attempts.
Could someone help me escape the inch quote " inside the double quotes? I know solutions like this are only 99%, i can only rely on the following facts.
- Delimiter is ;
- Enclosure is "
- " can occur multiple times within the quoted text field.
This means a ; or " might occur within the quoted fields. Can someone help me replace the " with ""?
My attempt at regex combining several stackoverflow posts.
sed -E "s/[^\"](?<!;)\"(?!;|$)/\1"/g" $filename.test2 -> error
sed "s/[^\"](?<!;)(\")(?!;|$)/\1/g" $filename.test2 -> error
... about 10 more variations, some even without errors but no replaced strings.
If someone has another solution other then regex, any help is much appreciated!
Edit: Thanks to @choroba the perl wizzard. The following fixes the file.
cat $filename.test | perl -pe 's/(?<=[^;])"(?=[^;])/""/g' > $filename.test2
",00000000"""00000000". What should happen for it? - revogo- Elias Van Ootegem