3
votes

I'm using Html Agility Pack on a website to extract some data. Parsing some of the HTML I need is easy but I am having trouble with this (slightly complex?) piece of HTML.

<tr>
  <td>
    <div onmouseover="toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class=&quot;correct_response&quot;>Obama</em><br /><br /><table width=&quot;100%&quot;><tr><td class=&quot;right&quot;>Kailyn</td></tr></table>')" onmouseout="toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')" onclick="togglestick('clue_J_1_1_stuck')">
... 

I need to get the value from the em class "correct_response" in the onmouseover div based on the clue_J_X_Y value. I really don't know how to go beyond this..

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//tr//td/div[@onmouseover]");

Some help would be appreciated.

1

1 Answers

1
votes

I don't know what you're supposed to get out from the em. But I will give you all the data you say you need to figure it out.

First we load the HTML.

    string html = "<tr>" +
        "<td>" +
        "<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class=&quot;correct_response&quot;>Obama</em><br/><br/><table width=&quot;100%&quot;><tr><td class=&quot;right&quot;>Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
    HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    //Console.WriteLine(doc.DocumentNode.OuterHtml);

Then we get the value of the attribute, onmouseover.

        string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[@onmouseover]").GetAttributeValue("onmouseover", "FAILED");

It will return FAILED if it failed to find an attribute named "onmouseover". Now we get the parameters of the toggle method where each are enclosed by two '(apostrophe).

//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
    if (toggle[i] == '\'' && flag== true)
    {
        toggleVariables.Add(temp);
        temp = "";
        flag = false;
    }
    else if (flag)
    {
        temp += toggle[i];
    }
    else if (toggle[i] == '\'')
    {
        flag = true;
    }
}

After that we have a list with 3 entities. In this case it will contain the following.

  • clue_J_1_1
  • clue_J_1_1_stuck
  • <em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>;

Now we can create a new HtmlDocument with the HTML code from the third parameter. But first we have to convert it into workable HTML since the third parameter contains escape characters from HTML.

        //Make it into workable HTML
        toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);

        //New HtmlDocument
        HtmlDocument htmlInsideToggle = new HtmlDocument();
        htmlInsideToggle.LoadHtml(toggleVariables[2]);

        Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);

And done. The code in it's entirety is below from here.

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using HtmlAgilityPack;
using System.Web;

namespace test
{
    class Program
    {

    public static void Main(string[] args)
    { 
            string html = "<tr>" +
                "<td>" +
                "<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class=&quot;correct_response&quot;>Obama</em><br/><br/><table width=&quot;100%&quot;><tr><td class=&quot;right&quot;>Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
            HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            //Console.WriteLine(doc.DocumentNode.OuterHtml);

            string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[@onmouseover]").GetAttributeValue("onmouseover", "FAILED");
            //Clean up string

            //Console.WriteLine(toggle);

            //Get Variables from toggle()
            List<string> toggleVariables = new List<string>();
            bool flag = false; string temp = "";
            for(int i=0; i<toggle.Length; i++)
            {
                if (toggle[i] == '\'' && flag== true)
                {
                    toggleVariables.Add(temp);
                    temp = "";
                    flag = false;
                }
                else if (flag)
                {
                    temp += toggle[i];
                }
                else if (toggle[i] == '\'')
                {
                    flag = true;
                }
            }

            //Make it into workable HTML
            toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
            //New HtmlDocument
            HtmlDocument htmlInsideToggle = new HtmlDocument();
            htmlInsideToggle.LoadHtml(toggleVariables[2]);

            Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);

            //You're on your own from here                

            Console.ReadKey();

    }
}