0
votes

I want to scrape a href from a youtubelplaylist-link using HTML agility pack. This code works, but the problem is that it's a table, and I don't know how to scrape each href separately.

            var html = new HtmlDocument();
        html.LoadHtml(new WebClient().DownloadString("https://www.youtube.com/playlist?list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC"));
        var root = html.DocumentNode;
        var p = root.Descendants()
            .Where(n => n.GetAttributeValue("class", "").Equals("pl-video-title"))
            .FirstOrDefault()
            .Descendants("a").Select(node => node.GetAttributeValue("href", ""))
            .FirstOrDefault();

            var points = ("https://youtube.com/embed/" + (Regex.Replace(p, "list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=1", "").Trim()));

This code works, but the problem is that it's a table and with this code I only recieve the first href, and I don't know how to scrape each href separately in the table (about 10 of them). This is the "Selector/ID/class" I wan't to scrape from:

#pl-load-more-destination > tr:nth-child(1) > td.pl-video-title

When I put this in instead of "pl-video-title" I get error.

I've been looking at XPath but I can't get it to work..

1
Should take a look here as its picking every link dotnetperls.com/scraping-htmlKhawajaAtteeq

1 Answers

1
votes

Assuming you want the href of the playlist links/videos, you can get them with the following:

(Note that I use ScrapySharp nuget library along with HtmlAgilityPack to provide support for css selectors, using CssSelect extension (adding using ScrapySharp.Extensions)

HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("https://www.youtube.com/playlist?list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC");

The output looks like

/watch?v=9bZkp7q19f0&list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=1 where the index parameter changes according to the number of link in the list.

Don't forget to add www.youtube.com to the link if you plan to use it in a further scraping (like it is it is not a valid uri to access from outside the site since it is not absolute).

var links = htmlDoc.DocumentNode.CssSelect(".pl-video-title-link");
foreach (var link in links)
    Console.WriteLine(link.GetAttributeValue("href"));

Update:

To remove a given key from the url query string, here's a simple way to do it:

    string url = "http://www.youtube.com/watch?v=bbEoRnaOIbs&list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=100";

    var parsedQs = HttpUtility.ParseQueryString(url);
    parsedQs.Remove("index");

    Console.WriteLine(parsedQs.ToString());

The url will then look like

enter image description here