2
votes

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.

public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("item/link");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

Does anybody have an understanding of why this element behaves differently to the others?

Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.

Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.

<site><link>Hello World</link><name>Fred</name></site>

I am certain its because of the world "link". If I change it to "linkz" it works perfectly.

The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.

public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
    rssSource = rssSource.Replace("<link>", "<link-renamed>");
    rssSource = rssSource.Replace("</link>", "</link-renamed>");
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
1

1 Answers

4
votes

If you print the DocumentNode.OuterHtml, you will see the problem :

var html = @"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);

output :

<site><link>Hello World<name>Fred</name></site>

link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :

var html = @"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link");  //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
    Console.WriteLine(link.InnerText);
}

Dotnetfiddle Demo

output :

<site><link>Hello World</link><name>Fred</name></site>
Hello World

*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.