4
votes

I'm parsing an XLIFF document using the XDocument class. Does XDocument perform some validation of the content which I read into it, and if so - is there any way to disable that validation?

I'm getting some weird errors if the XLIFF isn't valid XML (I don't care that it isn't, I just want to parse it).

E.g.

'.', hexadecimal value 0x00, is an invalid character. 

I'm currently reading the file like this:

string FileLocation = @"C:\XLIFF\text.xlf";
XDocument doc = XDocument.Load(FileLocation);

Thanks.

4
How do you load xml into XDocument? From file ? Can you show that line of code?HABJAN
If it isn't valid, then it isn't XML. How could XDocument work with it?František Žiačik
@HABJAN - Yes I'm just loading the content from a file.Jimmy Collins
@Jimmy C: can i see that couple lines of code?HABJAN
@habjan I've added the code I use to read the file.Jimmy Collins

4 Answers

5
votes

I had similar problem which was fixed by letting StreamReader to read the content.

// this line throws exception like yours
XDocument xd = XDocument.Load(@"C:\test.xml");

// works
XDocument xd = XDocument.Load(new System.IO.StreamReader(@"C:\test.xml"));

If that does not help, try to include proper encoding.

4
votes

If you want to strip characters from strings that are invalid for use in XML, you can use this method:

private static string RemoveXmlInvalidCharacters(string s)
{
    return Regex.Replace(
        s,
        @"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]",
        string.Empty);
}

It removes any characters that fall outside of the set of valid character values, according to the XML standard.

2
votes

You can't parse invalid XML, because parsing requires a valid XML structure.
It might be the case that you read the file as ASCII when you should have read it as UTF-8 or UTF-16 and that leads to the problem you encountered.

Possible solution:
Read the file as UTF-8.

2
votes

XLIFF document is an XML document. Character 0x00 is not a valid XML character. Invalid XML is not an XML so you cannot read it using XML parsers.

Now well-formed is a different thing, you can use SAX parsers to read XML which is not well-formed but not Invalid XML.

Valid characters according to XML Specification:

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

UPDATE

Suggested solution: Pre-Process the files to remove invalid characters. Character \0 can be replaced with space unless it has a meaning (is binary) in which case it needs to come in Base64 format.