Read UTF-8 files correctly with PowerShell

Question

Following situation:

A PowerShell script creates a file with UTF-8 encoding
The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
This can be iterated many times

With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.

I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?

Update

I have added a little test script that shows what I'm trying to do and what happens instead.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt
    if ($data -match "^[0-9-]{10} - r([0-9]+)")
    {
        $startRev = [int]$matches[1] + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
    "Line 1`r`n" + `
    "Line 2`r`n`r`n"

# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8

After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).

Instead, the second run gives me an error.

I'm not great with the whole encoding topic, but wouldn't you have to re-inject the BOM, if it gets removed, in order to read it properly? I'm a little confused by the question. Why do you want to remove the UTF-8 BOM? — Trevor Sullivan
My text editor is stupid and removes it. Anyway you never know what text editors do with UTF-8 files. My script should simply be smart enough to handle it. Like the StreamReader class does it pretty well. — ygoe

JPBlanc JPBlanc · Accepted Answer · 2014-04-01T16:20:34

If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :

Get-Content -Path test.txt -Encoding UTF8

Read UTF-8 files correctly with PowerShell

3 Answers