1
votes

I'm trying to use beautifulsoup to extract the name of the channel creator, along with the link to their channel, from a video page.

Here I have the inspector showing the exact line I want to scrape

I've tried using the class_ keyword argument. I get [] as a result. What should I do? Do I need to go through parent div tag and then "go down" as they say in Beautifulsoup? How should/could I go about soup.find for that particular a tag and class?

soup = BeautifulSoup(response.text, "html.parser")

videotitle = soup.find("meta", {"property":"og:title"})["content"]

videochannel = soup.body.find_all("a", class_="yt-simple-endpoint style-scope yt-formatted-string")
2
Why don't use selenium to do this?jizhihaoSAMA
You need to use selenium. You can tell why, if you disable javascript on youtube.awakenedhaki
Noobie here. Just learned selenium existed. Could you elaborate?cruiz-wa

2 Answers

1
votes

Ok so first off, you do not need Selenium. It's very rare you ever need Selenium. Even with javascript/ajax calls. If you ever get that deep into ajax calls you just need to GET/POST XSFR-Token keys back and forth until you get the data you want. Selenium is really heavy, bloated, and slow compared to simple HTTP calls via requests. Avoid it when you can. If you're completely stuck and don't know how to navigate ajax-post/request tokens, then by all means, use it. Better something than nothing.

Now, the reason you're not getting the desired response is that what your browser and python-requests package see are two completely different responses. So right from the start, you can't even navigate where you're going because you're looking at the wrong map. The browser has it's own custom map, and requests package has an entirely different map. That's where the package PPRINT comes in very handy (pictures below). PPRINT helps you see the response you get back clearer by formatting the text in a cleaner structure.

Lastly, I use Jupyter Notebook from Anaconda because it allows me to work on chunks of the code at a time without having to run the whole program. If you're not already using Jupyter Notebooks I suggest you give it a go. It will help you see how everything works with portions of your output "frozen in time".

Best of luck! Hope you weren't too discouraged. This all takes time.

Here is the workflow I used to solve your problem:

enter image description here

enter image description here

from bs4 import BeautifulSoup
import requests
import pprint as pp

url = "https://www.youtube.com/watch?v=hHW1oY26kxQ"

response = requests.get(url, headers={'User-Agent':USER_AGENT})
soup = BeautifulSoup(response.text, 'lxml')

for div in soup.find_all("div", {"id": "watch7-user-header"}):
    for a in div.find_all("a"):
        continue
    print(a["href"])
1
votes

You could use Selenium to open a browser then give it a URL and locate elements using CSS selectors. Here's some starter code that can locate the element you're looking for:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Opens a chrome browser window
browser = webdriver.Chrome(executable_path="/PATH/TO/CHROMEDRIVER")                            

# Navigates to a link
browser.get("https://www.youtube.com/watch?v=hHW1oY26kxQ")

time.sleep(5)

# Locate the element using a CSS selector
chilledCowElem = browser.find_element_by_css_selector("div.ytd-channel-name a")

# Access the name of the channel and gets its href value
print(chilledCowElem.text)
print(chilledCowElem.get_attribute("href"))

time.sleep(5)
browser.quit()

Output: output

You have to plug in the path to a driver on line 5. I'm using Chrome's driver which you can download here https://sites.google.com/a/chromium.org/chromedriver/downloads. Here's the selenium docs if you want to find out more on how to set it up for your project and use it: https://selenium-python.readthedocs.io/installation.html#drivers.