3
votes

I am having trouble getting some specific table's with HTML Agility Pack. I cannot change the actual HTML either, so I can't use other ID"s or Classes or anything.

Can someone show me how I would access each individual table of the following?

<table class="newTable">
      //table 1 contents
    <table border="0" cellpadding="3" cellspacing="2" width="100%">
         //table 1 - A contents
    </table>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="newTable">
     //table 2 contents
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
        //table 2 - A contents
    </table>
    <table width="100%" border="0" cellspacing="2" cellpadding="0">
       //table 2 - B contents
    </table>
    <table width="100%" cellspacing="2" cellpadding="0">
       //table 2 - C contents
    </table>
</table>
<table>
     //table 3 contents
</table>

Right now if I were to call the following

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table");
foreach (var cell in table.SelectNodes("//tr/td"))
{
     string someVariable = cell.InnerText
}

I would go through everything. I want to be able to access tables differently to correlate where I am storing the data.

I have tried looking at something like

doc.DocumentNode.SelectNodes("//table[1]");

but using an index does not seem to work, when I try to specify a table with it, it still reads in all tables or none.

Same thing applies to this, it either does not work at all or gets everything.

foreach (var cell in table.SelectNodes("//table").Skip(some_number))
{
     string someVariable = cell.InnerText
}

I am using the NuGet package of HTML Agility Pack 1.4.9

EDIT:

My attempt to get ONLY Table 1 - A's contents. Both give null or endcodingfound exceptions.

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table/tr/td/table[1]");

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]/tr/td/table[1]");

1

1 Answers

8
votes

The error is with your second call, the "//tr/td" will go back to the root element. Your indexer is the correct solution for the first part of your problem, the second can be fixed by specifying that you want to navigate from where you are at:

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var cell in table.SelectNodes(".//tr/td")) // **notice the .**
{
     string someVariable = cell.InnerText
}

Not sure what else is going on, but by extending your test table to this code, the following just works on my test. It might mean that you need to share a little more context.

This is the Document I used for the tests:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
    <table class="newTable">
        <tr>
            <td>
                <table border="0" cellpadding="3" cellspacing="2" width="100%">
                    <tr><td>
                        //table 1 - A contents
                    </td></tr>
                </table>
            </td>
        </tr>

    </table>
    <table border="0" cellpadding="0" cellspacing="0" class="newTable">
        <tr>
            <td>
                //table 2 contents
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - A contents
                        </td>
                    </tr>
                </table>
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - B contents
                        </td>
                    </tr>
                </table>
                <table width="100%" cellspacing="2" cellpadding="0">
                    <tr>
                        <td>
                            //table 2 - C contents
                        </td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
    <table>
        <tr>
            <td>
                //table 3 contents
            </td>
        </tr>
    </table>
</body>
</html>

And this the code to extract the values you're after:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

var node1A = doc.DocumentNode.SelectSingleNode("//table[1]//table[1]");
string content1A = node1A.InnerText;
Console.WriteLine(content1A);

var node2C = doc.DocumentNode.SelectSingleNode("//table[2]//table[3]");
string content2C = node2C.InnerText;
Console.WriteLine(content2C);

Shows:

enter image description here

Update

Ok, I took your actual HTML and I get a NullReference as well. There must be something that greatly confuses the Agility Pack, not sure why. Some experimentation with the Linq API seems to work though, I hope it can be an alternative for you:

var table = doc.DocumentNode.DescendantsAndSelf("table").Skip(1).First().Descendants("table").First();
var tds   = table.Descendants("td");