0
votes

We have some extracts that are being sent to us from a different system. The encoding keeps changing. We need to maintain that internally. Wrote a PowerShell script to change the encoding to UTF-8. But the accented characters are changed. The name Denaè changes to Denaè I would like to retain the name as Denaè ? Any help would be much appreciated.

I would like to change the file to UTF-8 with the accented characters unchanged, using PowerShell. Is it possible?

vonPryz

Here is the code that i have right now

$Source = 'C:\Source'

$Destination = 'C:\Destination'

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)

Remove-Item $Destination -Recurse -Force

foreach ($i in Get-ChildItem $Source -Recurse -Force) { if ($i.PSIsContainer) { continue }

$path = $i.DirectoryName.Replace($Source, $Destination)
$name = $i.Fullname.Replace($Source, $Destination)

if ( !(Test-Path $path) ) {
    New-Item -Path $path -ItemType directory
}

$content = get-content $i.Fullname

if ( $content -ne $null ) {
    [System.IO.File]::WriteAllLines($name, $content, $Utf8NoBomEncoding)
} else {
    Write-Host "No content from: $i"   
}

}

1
The substitution è -> è looks like the file is converted into plain text, so it's no longer Unicode. Show the relevant parts of code that processes the file with a minimal reproducible example. - vonPryz
Powershell 5.1 get-content won't detect utf8 no bom without the "-encoding utf8" parameter. - js2010

1 Answers

0
votes

è characters are UTF-8 appearance of the character è Latin Small Letter E With Grave (codepoint U+00E8).

Proof:

[System.Text.Encoding]::UTF8.GetBytes([char]'è') -join ', '
# 195, 168

[System.Text.Encoding]::GetEncoding(1252).GetBytes([char[]]'è') -join ', '
# 195, 168

[char[]][System.Text.Encoding]::UTF8.GetBytes([char]'è') -join ''
# è