1
votes

I am parsing a big XML file ~500MB, and it contains some invalid XML character 0x07 , so you can imagine what's happening, the XMLReader is throwing an Invalid XML character exception, to handle this we streamed the Stream into StreamReader and used Regex.Replace and wrote the result to memory using StreamWriter and stream the clean version back to XMLReader, now I would like to avoid this and skip this filthy tag from the XMLReader directly, my question is if there's anyway to achieve that, below is the code snippet where I try to do this but it's throwing the exception at this line var node = (XElement)XNode.ReadFrom(xr);

        protected override IEnumerable<XElement> StreamReader(Stream stream, string elementName)
    {

        var arrTag = elementName.Split('|').ToList();
        using (var xr = XmlReader.Create(stream, new XmlReaderSettings { CheckCharacters = false }))
        {
            while (xr.Read())
            {
                if (xr.NodeType == XmlNodeType.Element && arrTag.Contains(xr.Name))
                {
                    var node = (XElement)XNode.ReadFrom(xr);
                    node.ReplaceWith(node.Elements().Where(e => e.Name != "DaylightSaveInfo"));
                    yield return node;
                }
            }
            xr.Close();
        }
 }

XML SAMPLE, the invalid attribute DaylightSaveInfo

<?xml version="1.0" encoding="ISO-8859-1"?>
<LATree>
<LA className="BTT00NE" fdn="NE=9739">
    <attr name="fdn">NE=9739</attr>
    <attr name="IP">10.157.144.100</attr>
    <attr name="realLatitude">0D0&apos;0&quot;S</attr>
    <attr name="realLongitude">0D0&apos;0&quot;W</attr>
    <attr name="DaylightSaveInfo">NO</attr>
</LA>
</LATree>

BEL Character

2
It is not invalid!!! See wiki : en.wikipedia.org/wiki/…jdweng
@LeonardoHerrera: This isn't about encodings - if the text contains a bell character, that's just invalid XML (1.0).Jon Skeet
@LeonardoHerrera: I suspect it'll be hard for the OP to include a bell character in a copy/pastable form. Although it's easy enough to construct a string with that in and then show that failing to parse.Jon Skeet
If you're being provided with invalid XML, I would start by asking wherever you're getting that XML from to give you a valid XML document to start with. If they're supplying invalid data in that aspect, who knows what else is wrong...Jon Skeet
This code var xml = "<?xml version=\"1.0\"?><root>\a</root>" reproduces the issue. You need to clean up your stream, there is no way around it.Leonardo Herrera

2 Answers

2
votes

I just saw that Jon Skeet wrote something about this, so I cannot take credit really, but since his stature on SO is way above mine, I could perhaps gain a point or two for writing it. :)

First I wrote a class that overloads the TextReader class. (Some reference material in the links.)

https://www.w3.org/TR/xml/#NT-Char

https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/io/textreader.cs

class FilterInvalidXmlReader : System.IO.TextReader
{
  private System.IO.StreamReader _streamReader;

  public System.IO.Stream BaseStream => _streamReader.BaseStream;

  public FilterInvalidXmlReader(System.IO.Stream stream) => _streamReader = new System.IO.StreamReader(stream);

  public override void Close() => _streamReader.Close();

  protected override void Dispose(bool disposing) => _streamReader.Dispose();

  public override int Peek()
  {
    var peek = _streamReader.Peek();

    while (IsInvalid(peek, true))
    {
      _streamReader.Read();

      peek = _streamReader.Peek();
    }

    return peek;
  }

  public override int Read()
  {
    var read = _streamReader.Read();

    while (IsInvalid(read, true))
    {
      read = _streamReader.Read();
    }

    return read;
  }


  public static bool IsInvalid(int c, bool invalidateCompatibilityCharacters)
  {
    if (c == -1)
    {
      return false;
    }

    if (invalidateCompatibilityCharacters && ((c >= 0x7F && c <= 0x84) || (c >= 0x86 && c <= 0x9F) || (c >= 0xFDD0 && c <= 0xFDEF)))
    {
      return true;
    }

    if (c == 0x9 || c == 0xA || c == 0xD || (c >= 0x20 && c <= 0xD7FF) || (c >= 0xE000 && c <= 0xFFFD))
    {
      return false;
    }

    return true;
  }
}

Then I created a console application and in the main I put:

  using (var memoryStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes("<Test><GoodAttribute>a\u0009b</GoodAttribute><BadAttribute>c\u0007d</BadAttribute></Test>")))
  {
    using (var xmlFilteredTextReader = new FilterInvalidXmlReader(memoryStream))
    {
      using (var xr = System.Xml.XmlReader.Create(xmlFilteredTextReader))
      {
        while (xr.Read())
        {
          if (xr.NodeType == System.Xml.XmlNodeType.Element)
          {
            var xe = System.Xml.Linq.XElement.ReadFrom(xr);

            System.Console.WriteLine(xe.ToString());
          }
        }
      }
    }
  }

Hopefully this could help, or at least provide some starter point.

-2
votes

Following xml linq code runs without errors. I used in xml file following "NO" :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace ConsoleApplication108
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            XmlReaderSettings settings = new XmlReaderSettings();
            settings.CheckCharacters = false;
            XmlReader reader = XmlReader.Create(FILENAME, settings);

            XDocument doc = XDocument.Load(reader);

            Dictionary<string, string> dict = doc.Descendants("attr")
                .GroupBy(x => (string)x.Attribute("name"), y => (string)y)
                .ToDictionary(x => x.Key, y => y.FirstOrDefault());

        }
    }

}