Powershell script to remove double quotes from CSV unless comma exists inside double quotes

6

votes

I have a .csv in the following file format:

In: "bob","1234 Main St, New York, NY","cool guy"

I am looking to remove double quotes that don't have a comma inside:

Out: bob,"1234 Main St, New York, Ny",cool guy

Is there a way to do this in Powershell?

I have checked:

regexpowershellcsv

4

votes

Adapting the code from "How to remove double quotes on specific column from CSV file using Powershell script":

$csv = 'C:\path\to\your.csv'
(Get-Content $csv) -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |
    Set-Content $csv

The regex (?m)"([^,]*?)"(?=,|$) is matching any " + 0 or more non-commas + " before a comma or end of line (achieved with a positive look-ahead and a multiline option (?m) that forces $ to match a newline, not just the end of string).

See regex demo

1

votes

I don't know exactly what the rest of your script looks like. Try something along these lines though

(("bob","1234 Main St, New York, NY","cool guy") -split '"' | 
  ForEach-Object {IF ($_ -match ",") {'"' + $_ + '"' } ELSE {$_}}) -join ","

1

votes

Update: In PowerShell [Core] v7+, a much simpler solution is possible: see this answer.

The existing answers work well with the sample input:

Wiktor Stribiżew's helpful answer, which identifies double-quoted fields that do not contain , using a regex, loads the entire input file into memory first, which enables replacing the input file with the results in a single pipeline.
While this is convenient - and faster than row-by-row processing - the caveat is that it may not be an option for large input files.
markg's helpful answer, which splits rows into fields by " chars., is an alternative for large input files, because it uses the pipeline to process the input rows one by one.
(As a consequence, the input file cannot be directly updated with the result.)

If we generalize the OP's requirement to also handle fields with embedded " chars., we need a different approach:

The following fields must then retain their enclosing double quotes:

(of necessity) double-quoted fields with embedded , chars.; e.g.,
"1234 Main St, New York, NY"
(of necessity) double-quoted fields with embedded " chars., which, per RFC 4180 must be escaped as "", i.e., doubled; e.g.,
"Nat ""King"" Cole"

^{Note:

- We are not dealing with fields that may contain embedded line breaks, as that would require a fundamentally different approach, because self-contained line-by-line processing is then no longer possible.

- Tip of the hat to Wiktor Stribiżew, who came up with the regex to robustly match a double-quoted field with an arbitrary number of embedded double quotes, escaped as "": "([^"]*(?:""[^"]*)*)"}

# Create sample CSV file with double-quoted fields that contain
# just ',', just embedded double quotes ('""'), and both.
@'
bob,"1234 Main St, New York, NY","cool guy"
nat,"Nat ""King"" Cole Lane","cool singer"
nat2,"Nat ""King"" Cole Lane, NY","cool singer"
'@ | Set-Content ./test.csv

Get-Content ./test.csv | ForEach-Object {
  # Match all double-quoted fields on the line, and replace those that 
  # contain neither commas nor embedded double quotes with just their content,
  # i.e., with enclosing double quotes removed.
  ([regex] '"([^"]*(?:""[^"]*)*)"').Replace($_, { param($match)
    $fieldContent = $match.Groups[1]
    if ($fieldContent -match '[,"]') { $match } else { $fieldContent }
  })
}

This yields:

bob,"1234 Main St, New York, NY",cool guy
nat,"Nat ""King"" Cole Lane",cool singer
nat2,"Nat ""King"" Cole Lane, NY",cool singer

Updating the input file:

As in markg's answer, due to line-by-line processing, you cannot directly update the input file with the output in the same pipeline.
To update the iput file later, use a temporary output file and then replace the input file with it (... represents the Get-Content pipeline from above, only with $csvFile instead of ./test.csv):

$csvfile = 'c:\path\to\some.csv'
$tmpFile = $env:TEMP\tmp.$PID.csv
... | Set-Content $tmpFile
if ($?) { Move-Item -Force $tmpFile $csvFile }

Note that Set-Content uses your system's single-byte, extended-ASCII character encoding by default (even though the help topic falsely states ASCII).

Using the -Encoding parameter allows you specify a different encoding, but note that UTF-16LE, which is the default for Out-File / >, causes the CSV file not to be recognized properly by Excel, for instance.

1

votes

In PowerShell [Core] v7+, you can now use -UseQuotes AsNeeded with ConvertTo-Csv and Export-Csv, which greatly simplifies the approach:

@'
Name,Address,Comment
"bob","1234 Main St, New York, NY","cool guy"
'@ | ConvertFrom-Csv | 
       ConvertTo-Csv -UseQuotes AsNeeded # use Export-Csv to save to a file.

The above yields the following, showing that only the field with the embedded , was double-quoted:

Name,Address,Comment
bob,"1234 Main St, New York, NY",cool guy

See the Export-Csv v7+ docs:

-UseQuotes accepts AsNeeded, Always, and Newer
Seperately, there's also -QuoteFields, which accepts an array of column (property) names to which quoting should (selectively be applied).

Powershell script to remove double quotes from CSV unless comma exists inside double quotes

4 Answers