14
votes

Following situation:

  • A PowerShell script creates a file with UTF-8 encoding
  • The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
  • The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
  • This can be iterated many times

With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.

I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?

Update

I have added a little test script that shows what I'm trying to do and what happens instead.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt
    if ($data -match "^[0-9-]{10} - r([0-9]+)")
    {
        $startRev = [int]$matches[1] + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
    "Line 1`r`n" + `
    "Line 2`r`n`r`n"

# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8

After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).

Instead, the second run gives me an error.

3
I'm not great with the whole encoding topic, but wouldn't you have to re-inject the BOM, if it gets removed, in order to read it properly? I'm a little confused by the question. Why do you want to remove the UTF-8 BOM?Trevor Sullivan
My text editor is stupid and removes it. Anyway you never know what text editors do with UTF-8 files. My script should simply be smart enough to handle it. Like the StreamReader class does it pretty well.ygoe

3 Answers

28
votes

If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :

Get-Content -Path test.txt -Encoding UTF8
4
votes

Really JPBlanc is right. If you want it read as UTF8 then specify that when the file is read.

On a side note, you're losing formatting in here with the [String]+[String] stuff. Not to mention your regex match doesn't work. Check out the regex search changes, and the changes made to the $newMsgs, and the way I'm outputting your data to the file.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt #-Encoding UTF8
    if($data -match "\br([0-9]+)\b"){
        $startRev = [int]([regex]::Match($data,"\br([0-9]+)\b")).groups[1].value + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = @"
2014-04-01 - r$startRev`r`n`r`n
    Line 1`r`n
    Line 2`r`n`r`n
"@

# Write new data back
$newmsgs,$data | Out-File test.txt -Encoding UTF8
2
votes

Get-Content doesn't seem to handle UTF-files without BOM at all (if you omit the Encoding-flag). System.IO.File.ReadLines seems to be an alternative, examples:

PS C:\temp\powershellutf8> $a = Get-Content .\utf8wobom.txt
PS C:\temp\powershellutf8> $b = Get-Content .\utf8wbom.txt
PS C:\temp\powershellutf8> $a2 = Get-Content .\utf8wbom.txt -Encoding UTF8
PS C:\temp\powershellutf8> $a
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ  <== This doesnt seem to be right at all
PS C:\temp\powershellutf8> $b
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $a2
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8>
PS C:\temp\powershellutf8> $c = [IO.File]::ReadLines('.\utf8wbom.txt');
PS C:\temp\powershellutf8> $c
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $d = [IO.File]::ReadLines('.\utf8wobom.txt');
PS C:\temp\powershellutf8> $d
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ <== Works!