Facing issues with properly formatting CSV data

Question

Before I begin my question and background information, I'd like to state that I realize that many people have asked a similar question, but none of the answers to their questions have applied to my situation.

Background info: I'm trying to properly format a very large CSV file so that I can import it into a table in my PostgreSQL database. This CSV file only contains two fields, and the delimiter is ;

Problems encountered/attempted solutions

Problem #1: The delimiter is a semicolon, and many of the values in one of the fields contain semicolons. PostgreSQL obviously doesn't like this.

Solution #1: I used sed to change the delimiter to a string of characters that I knew would only occur as a delimiter.

Problem #2: The delimiter can only be a single character.

Solution #2: I changed the delimiter to a unicode character that I knew wouldn't occur as anything other than a delimiter.

Problem #3: The delimiter can only be a single-byte character.

Solution #3: I decided to go back in my steps, and rather than mess with the delimiter, I tried using sed to enclose all field values in double quotes in order to avoid the problem of some of the values containing the delimiter character. More specifically, I tried using the command found in the answer to this question - sed statement to change/modify CSV separators and delimiters

Problem #4: This resulted in many data errors, as any time a delimiter was in one of the values, double quotes were placed around it, and this caused Postgre SQL to attempt to copy values that were far too long and were simply not individual values. This row here is a perfect example of that -

    "m[redacted]@[redacted].com";"mk,l.";"/'"

This row in particular made PostgreSQL think that it was copying 3 columns. Not to mention this row -

    "[redacted]&#39";"of&#39";"all&#39";"your&#39";"[redacted]@[redacted].com";"[redacted]@[redacted].com:hapa[redacted]hoha"

Which made PostgreSQL attempt to copy the entire rest of the file into the second field as a single value.

Question

With all of that having been said, my final question is this - how can I enclose every value in the CSV file in double quotes in such a way that it will be properly imported into PostgreSQL?

Right now I'm backed against a wall and would appreciate any advice, even if it isn't a clear answer. I've tried everything I can think of. If an answer is even possible, I'd like one that can apply to CSV files that contain more than two fields, as I have many more CSV files to import after this one.

Wayne Niddery Wayne Niddery · Accepted Answer · 2018-07-20T02:27:37

You state that one of the two fields can contain semicolons. If so (the other field does not ever contain any) then the 1st semicolon abutting this field is the delimiter. If the field containing semicolons as part of the data is first, then you need to find the last semicolon on the line, otherwise the first.

I've never used SED but regex allows you to match on the first or last occurrence of a character thus you can replace this single semicolon with a temporary character or pattern, then you should be able to successfully place quotes around the fields, and finally change the temporary field delimiter back.

Facing issues with properly formatting CSV data

Problems encountered/attempted solutions

Question

1 Answers