Update: In PowerShell [Core] v7+, a much simpler solution is possible: see this answer.
The existing answers work well with the sample input:
- Wiktor Stribiżew's helpful answer, which identifies double-quoted fields that do not contain
, using a regex, loads the entire input file into memory first, which enables replacing the input file with the results in a single pipeline.
While this is convenient - and faster than row-by-row processing - the caveat is that it may not be an option for large input files.
- markg's helpful answer, which splits rows into fields by
" chars., is an alternative for large input files, because it uses the pipeline to process the input rows one by one.
(As a consequence, the input file cannot be directly updated with the result.)
If we generalize the OP's requirement to also handle fields with embedded " chars., we need a different approach:
The following fields must then retain their enclosing double quotes:
- (of necessity) double-quoted fields with embedded
, chars.; e.g.,
"1234 Main St, New York, NY"
- (of necessity) double-quoted fields with embedded
" chars., which, per RFC 4180 must be escaped as "", i.e., doubled; e.g.,
"Nat ""King"" Cole"
Note:
- We are not dealing with fields that may contain embedded line breaks, as that would require a fundamentally different approach, because self-contained line-by-line processing is then no longer possible.
- Tip of the hat to Wiktor Stribiżew, who came up with the regex to robustly match a double-quoted field with an arbitrary number of embedded double quotes, escaped as "": "([^"]*(?:""[^"]*)*)"
# Create sample CSV file with double-quoted fields that contain
# just ',', just embedded double quotes ('""'), and both.
@'
bob,"1234 Main St, New York, NY","cool guy"
nat,"Nat ""King"" Cole Lane","cool singer"
nat2,"Nat ""King"" Cole Lane, NY","cool singer"
'@ | Set-Content ./test.csv
Get-Content ./test.csv | ForEach-Object {
# Match all double-quoted fields on the line, and replace those that
# contain neither commas nor embedded double quotes with just their content,
# i.e., with enclosing double quotes removed.
([regex] '"([^"]*(?:""[^"]*)*)"').Replace($_, { param($match)
$fieldContent = $match.Groups[1]
if ($fieldContent -match '[,"]') { $match } else { $fieldContent }
})
}
This yields:
bob,"1234 Main St, New York, NY",cool guy
nat,"Nat ""King"" Cole Lane",cool singer
nat2,"Nat ""King"" Cole Lane, NY",cool singer
Updating the input file:
As in markg's answer, due to line-by-line processing, you cannot directly update the input file with the output in the same pipeline.
To update the iput file later, use a temporary output file and then replace the input file with it (... represents the Get-Content pipeline from above, only with $csvFile instead of ./test.csv):
$csvfile = 'c:\path\to\some.csv'
$tmpFile = $env:TEMP\tmp.$PID.csv
... | Set-Content $tmpFile
if ($?) { Move-Item -Force $tmpFile $csvFile }
Note that Set-Content uses your system's single-byte, extended-ASCII character encoding by default (even though the help topic falsely states ASCII).
Using the -Encoding parameter allows you specify a different encoding, but note that UTF-16LE, which is the default for Out-File / >, causes the CSV file not to be recognized properly by Excel, for instance.