1
votes

I am using XmlDocument.Load to load the contents of an XML file that has some characters in Thai. The application is erroring out with the following exception.

System.Xml.XmlException: Invalid character in the given encoding. Line 2, position 82. at System.Xml.XmlTextReaderImpl.Throw(Exception e) at System.Xml.XmlTextReaderImpl.InvalidCharRecovery(Int32& bytesCount, Int32& charsCount) at System.Xml.XmlTextReaderImpl.GetChars(Int32 maxCharsCount) at System.Xml.XmlTextReaderImpl.ReadData() at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars) at System.Xml.XmlTextReaderImpl.FinishPartialValue() at System.Xml.XmlTextReaderImpl.get_Value() at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace) at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc) at System.Xml.XmlDocument.Load(XmlReader reader)

The XML file begins with this content enter image description here

Notice the strange character before the closing tag. This content is coming from a third-party and I don't have access to the file/content.

My questions are:

  1. Why is the strange character appearing in the content sent to my from the third party provider?
  2. Is there any way to successfully process the file (load it into the XmlDocument) since I don't have access to modifying its content before processing it?
2
Use XmlReaderSettings.CheckCharacters = false. But better - contact third party and ask them to fix an issue, because as is it seems to be not valid xml.Evk
The only useful recommendation is to start checking Jobs section on SO... If you can't work with that third party to make sure they return valid XML you are completely stuck because you can't reconstruct document correctly (how would you know what else is incorrect in the document?) Indeed you can search thousands of existing "read invalid XML" questions - maybe you find some inspiration there... Like use HTMLAgilityPack to read that text instead or manually strip invalid UTF8 bytes from the stream...Alexei Levenkov
Ensure also that it is indeed problem of third party and file is not damaged by yourself (for example by reading it assuming wrong non-utf-8 encoding.Evk

2 Answers

1
votes

If you are very sure that they are Thai characters, Then try correct data encoding in Load.

For Thai the Character encoding is - ISO 8859-11

So could you please try below way of doc load:

 xmlDoc.Load(new StreamReader(File.Open("YourXMLFile.xml"), 
                         Encoding.GetEncoding("iso-8859-11"))); 

Answer to first question, you may need to talk to the third party and ask them to look into their source code to find out why those unwanted characters are appearing in the generated XML.

0
votes

The data supplied by the third party is not valid XML. I think there's only two solutions i.e. Get the third party to supply valid XML or strip the invalid characters from the XML and process what you can. You could do this...

string invalidXML = File.ReadAllText(path);
var validXml = invalidXML.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray()
if (validXml != invalidXML)
   // log the invalid

// process (what you can in) the validXml