3
votes

I'm using a Powershell script as follows to convert a string to XML then export to a file (done this way to keep indenting):

[xml]$xmloutput = $xml
$sw = New-Object System.IO.StringWriter
$writer = New-Object System.Xml.XmlTextWriter($sw)
$writer.Formatting = [System.Xml.Formatting]::Indented
$xmloutput.WriteContentTo($writer)
$sw.ToString() | Set-Content -Encoding 'ASCII' $filepath

The destination has to be ASCII formatted due to a vendor restriction. The issue I'm seeing is ASCII just changes special characters into questions marks (example: Ö becomes ?).

If I use UTF8 encoding the output looks totally fine. I've even tried saving to UTF8 then converting to ASCII, does the same thing (exports a question mark):

[System.Io.File]::ReadAllText($filepath) | Out-File -FilePath $filepath -Encoding ASCII

If I try and replace the characters in the string before the conversion to XML (using ASCII code Ö) it simply converts the ampersand and leaves the rest, making it useless.

Is there any way to have Powershell correctly save those characters into the file?

EDIT: I would like to see the special character in the outputted file, but if that is not ASCII-compliant, I'd like to see the ASCII code for it (in this example, Ö)

I also don't want to see just an O, I need the actual character.

2
Which ASCII character are you expecting to see?Josh Lee
@JoshLee - I would like to see just the Ö character, but if that's not ASCII compliant, I'd expect to see Öchazbot7
That is an extended ascii character so I would think it would work. I don't know that much about powershell and how it handles codepages.Squashman
Try using -Encoding Default or -Encoding OEM. Per this Ms document. docs.microsoft.com/en-us/powershell/module/…Squashman
What happens when you set the StringWriter.Encoding Property to ASCII as well?iRon

2 Answers

6
votes

All characters in an XML document are Unicode. However, a representation of an XML document has a document encoding. Characters that are not members of that character set are written as character entity references, often numerically and in hexadecimal notation. The number is the Unicode codepoint.

It seems your partner's requirement is to use ASCII as the document encoding.

XmlDocument is a bit hard to work with but an XmlWriter with settings for the document encoding will work:

$myString = 'hellÖ'

[xml]$myXml = [System.Management.Automation.PSSerializer]::Serialize($myString)

$settings = New-Object System.Xml.XmlWriterSettings
$settings.Encoding = [System.Text.Encoding]::ASCII
$settings.Indent = $true

$writer = [System.Xml.XmlWriter]::Create("./test.xml", $settings)
$myXml.Save($writer)
$writer.Dispose()

This puts out an ASCII-encoded text file with an XML declation declaring the document encoding is ASCII and uses hexadecimal numeric character entity references for XML content characters that can't be represented in ASCII:

<?xml version="1.0" encoding="us-ascii"?>
<Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
  <S>hell&#xD6;</S>
</Objs>

As you can see here in the C1 Controls and Latin-1 Supplement block, U+00D6 (&#D6;), is Ö LATIN CAPITAL LETTER O WITH DIAERESIS

2
votes

This isn't really specific to PowerShell, it's a character encoding issue in general.

Basically, that character is not ASCII, it's ISO 8859-1.

But also, this process can be simplified by having the XmlTextWriter write directly to the file, since you can control the encoding with it. Try this:

$myString = 'hellÖ'

[xml]$myXml = [System.Management.Automation.PSSerializer]::Serialize($myString)

$myEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')

$writer = New-Object System.Xml.XmlTextWriter($filepath, $myEncoding)
$writer.Formatting = [System.Xml.Formatting]::Indented

$myXml.WriteContentTo($writer)

$writer.Flush()
$writer.Close()
$writer.Dispose()

This will write the file with the ISO 8859-1 encoding, but it will not encode into XML entities.

So if your application needs true ASCII only, no extended sets, then this won't work. If it really just needs single-byte encoding and the set of characters within this encoding is sufficient, then it's fine.


How to do it with entities:

Step 1: ignore what I wrote and use Tom Blodget's answer instead.

What you could do is set a custom fallback callback on the ASCII encoder, so that whenever it encounters a character that can't be represented in ASCII, it calls your function to get a replacement. Your function would helpfully just return the entity version of the character.

Technically.. that could backfire. Since you must return the ampersand & from the encoder, the XmlWriter might see that and "helpfully" replace it with &amp; which would ruin your encoding.

Using this callback directly from PowerShell might be possible, but will be a bit cumbersome. It would be easier with some C# and Add-Type.

Or you could do a guerilla version of this method: write your XML string, then manually replace any characters that aren't ASCII.

Here I'm using a version of the regex engine's replace method that takes a function for match evaluation. The regex just matches any character that is not in the 'BasicLatin' Unicode Named Block.

$myString = 'hellÖ'

[xml]$myXml = [System.Management.Automation.PSSerializer]::Serialize($myString)

$sw = New-Object System.IO.StringWriter
$writer = New-Object System.Xml.XmlTextWriter($sw)
$writer.Formatting = [System.Xml.Formatting]::Indented
$myXml.WriteContentTo($writer)

$output = [RegEx]::Replace($sw.ToString(), '\P{IsBasicLatin}', { param($match) '&#{0};' -f [int][char]$match.Value })
$output  | Set-Content -Encoding 'ASCII' -LiteralPath $filepath

As far as I can tell this will do exactly what you want.