1
votes

I am attempting to use the HTML Agility Pack to look up specific keywords on Google, then check through linked nodes until it find my websites string url, then parse the innerHTML of the node I am on for my Google ranking.

I am relatively new to the Agility Pack (as in, I started really looking through it yesterday) so I was hoping I could get some help on it. When I do the search below, I get Failures on my Xpath queries every time. Even if I insert something as simple as SelectNodes("//*[@id='rso']"). Is this something I am doing incorrectly?

    private void GoogleScrape(string url)
    {
        string[] keys = keywordBox.Text.Split(',');
        for (int i = 0; i < keys.Count(); i++)
        {
            var raw = "http://www.google.com/search?num=100&q=";
            string search = raw + HttpUtility.UrlEncode(keys[i]);
            var webGet = new HtmlWeb();
            var document = webGet.Load(search);
            loadtimeBox.Text = webGet.RequestDuration.ToString();

            var ranking = document.DocumentNode.SelectNodes("//*[@id='rso']");

            if (ranking != null)
            {
                googleBox.Text = "Something";
            }
            else
            {
                googleBox.Text = "Fail";
            }
           }
          }
1
What type of "failure" do you get?alexn
var ranking always comes back as null when looking for things under the .//*[@id='rso'] tag, which is what all their search results that dont have multiple results in the page returns. .//*[@id='resultStats'] returns "Something", but the exact equivalent in another tag returns nothingDanejir
Also, I can use Regex expressions to find the same "nodes", so I know they are showing as there and should be available to find in the Xpath direction, its just not returning resultsDanejir

1 Answers

2
votes

It's not the Agility pack's guilt -- it is tricky google's. If you inspect _text property of HtmlDocument with debugger, you'll find that <ol> that has id='rso' when you inspect it in a browser do not have any attributes for some reason.

I think, in this case you can just serach by "//ol", because there is only one <ol> tag in the google's result page at the moment...

UPDATE: I've done further checks. For example when I do this:

using (StreamReader sr = 
        new StreamReader(HttpWebRequest
          .Create("http://www.google.com/search?num=100&q=test")
          .GetResponse()
          .GetResponseStream()))
{
    string s = sr.ReadToEnd();
    var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
    foreach (var x in m2)
        Console.WriteLine(x);
}

The only ids that are returned are: "sflas", "hidden_modes" and "tbpr_12".

To conclude: I've used Html Agility Pack and it's coped pretty well even with malformed html (unclosed <p> and even <li> tags etc.).