0
votes

Issue description:

  • I need to fix an issue with resolving of standard HTML entitities.
  • I've implemented HtmlEntityReader - implementation of XmlReader which has a code to resolve entities
  • Public API of our system provides a methods with usage of XmlReader, so user can pass XmlReader created using one of the XmlReader.Create methods

Current code of my xml unit tests is below:

using System.Xml;
using NUnit.Framework;

namespace Tests
{
    [TestFixture]
    public class XmlTests
    {
        // this test works
        [Test]
        public void TestEntitiesResolving1()
        {
            var path = QA.ResolvePath(@"html\bugs\317.html");
            using (var reader = new XmlTextReader(path, new NameTable()))
            {
                reader.XmlResolver = null; //to prevent DTD downloading
                var wrapper = new HtmlEntityReader(reader, XmlUtils.HtmlEntities);
                while (wrapper.Read()) { }
            }
        }

        // this test does not work - why?
        // what's the difference in initialization of internal XmlTextReaderImpl?
        [Test]
        public void TestEntitiesResolving2()
        {
            var path = QA.ResolvePath(@"html\bugs\317.html");
            var settings = new XmlReaderSettings
                           {
                               XmlResolver = null, //to prevent DTD downloading
                               NameTable = new NameTable(),
                               ProhibitDtd = false,
                               CheckCharacters = false,
                           };
            using (var reader = XmlReader.Create(path, settings))
            {
                var wrapper = new HtmlEntityReader(reader, XmlUtils.HtmlEntities);
                while (wrapper.Read()) { }
            }
        }
    }
}

Partial code of HtmlEntityReader is below:

internal sealed class HtmlEntityReader : XmlReader
{
    readonly XmlReader _impl;
    readonly Hashtable _entitySet;
    string _entityValue;

    public HtmlEntityReader(XmlReader reader, Hashtable entitySet)
    {
        if (reader == null) throw new ArgumentNullException("reader");
        if (entitySet == null) throw new ArgumentNullException("entitySet");
        _impl = reader;
        _entitySet = entitySet;
    }

    public override XmlNodeType NodeType
    {
        get { return _entityValue != null ? XmlNodeType.Text : _impl.NodeType; }
    }

    public override string LocalName
    {
        get { return _entityValue != null ? string.Empty : _impl.LocalName; }
    }

    public override string Prefix
    {
        get { return _entityValue != null ? string.Empty : _impl.Prefix; }
    }

    public override string Name
    {
        get { return _entityValue != null ? string.Empty : _impl.Name; }
    }

    public override bool HasValue
    {
        get { return _entityValue != null || _impl.HasValue; }
    }

    public override string Value
    {
        get { return _entityValue ?? _impl.Value; }
    }

    public override bool CanResolveEntity
    {
        get { return true; }
    }

    public override void ResolveEntity()
    {
        //it seems this does not call - why?
    }

    public override bool Read()
    {
        _entityValue = null;
        if (!_impl.Read()) return false;
        if (NodeType == XmlNodeType.EntityReference)
        {
           //resolving of entity reference
           _entityValue = (string)_entitySet[Name];
        }
        return true;
    }

    // ... delegation of XmlReader abstract methods to _impl
}

I've got the exception:

System.Xml.XmlException: Reference to undeclared entity 'nbsp'. Line 4, position 5.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.Throw(String res, String arg, Int32 lineNo, Int32 linePos)
at System.Xml.XmlTextReaderImpl.HandleGeneralEntityReference(String name, Boolean isInAttributeValue, Boolean pushFakeEntityIfNullResolver, Int32 entityStartLinePos)
at System.Xml.XmlTextReaderImpl.HandleEntityReference(Boolean isInAttributeValue, EntityExpandType expandType, ref Int32 charRefEndPos)
at System.Xml.XmlTextReaderImpl.ParseText(ref Int32 startPos, ref Int32 endPos, ref Int32 outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlTextReaderImpl.Read()
... private staff

Could you provide a quick advice or link to a solution while I am fixing / investigating / searching this issue through my own efforts?

2
I am unsure what you mean by "doesn't work". It appears to me that both of those tests should pass - and if one doesn't, it's because an exception is being thrown somewhere. If you're getting an exception, you really should tell us what it is.Anon.
Agreed with first commenter. Copy the exception and put it hereJon Limjap
I don't know if it's the problem specifically, but consider writing your own UrlResolver class for resolving the entities. I've seen such classes get called.John Saunders
@John Saunders thanks we'll trysergeyt
FYI: XmlResolver/XmlUrlResolver only gets called for external entities/resources not for character entities.Scott Willeke

2 Answers

1
votes

I've done some research on your question and as best I can tell the only way to ensure that character entities are resolved is to declare them in a DTD. You can resolve the DTD contents yourself (e.g. for caching) by deriving an implementation from the Systm.Xml.XmlResolver base class and responding to GetEntity calls with a stream containing the DTD data.

I wrote an article some time back that explains how to push a default DTD onto the XmlParserContext if there is no DTD declared in your input document. This article is a little dated, but the same concept continues to work with XmlReaderSettings & XmlReader.Create by using an XmlReader.Create overload that accepts an XmlParserContext object as an argument.

Finally, it also looks like .NET 4 will help us out a little with a new XmlResolver derivative named XmlPreloadedResolver which seems to have the XHTML1 and RSS DTDs built in.

0
votes

The funny thing is that, as sergeyt noted, XmlTextReader doesn't care about undefined entities when processing a xml fragment, while XmlReader does!

So a solution in many cases would be to try with an XmlTextRader.