Web Scraping - Detective Conan

Question

I am trying to download all the episodes of detective Conan from https://www.kiss-anime.ws/ (kudos to them). While scraping the download URL from the website, I am facing an issue.

Let's say I want to download the first episode of detective Conan, so I use this URL (https://www.kiss-anime.ws/Anime-detective-conan-1) to scrape the download URL. Now when I try to get the HTML code of the website, to extract the download URL, using the following code:

from urllib.request import Request, urlopen

req = Request('https://www.kiss-anime.ws/Anime-detective-conan-1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

I get the following error

Traceback (most recent call last): File "refine.py", line 41, in webpage = urlopen(req).read() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable

I don't want to go to every link and click on the download link manually as there are over 900 episodes. Once I have the link I will download the episode using the following code: (in case anyone is wondering how I would do that)

import webbrowser
webbrowser.open("https://www.kiss-anime.ws/download.php?id=VkE3eFcvTlpZb0RvKzJ0Tmx2V2ROa3J4UWJ1U09Ic0VValh1WGNtY2Fvbz0=&key=B2X2kHBdIGdzAxn4kHmhXDq0XNq5XNu1WtujWq==&ts=1584489495")

Any help would be much appreciated, thank you!

Are you limited to using only built-in Python modules or you can use third party modules, like requests, too? — Daweo
Daweo when I use the requests module, the site detects that I am using a bot using hidden input hence I am not able to scrape data. Do you have any other solution in mind? — Samyak Jain
I suggest taking look at Selenium, as by using it you looks like human user for webpage — Daweo

tidakdiinginkan tidakdiinginkan · Accepted Answer · 2020-03-18T23:45:45

So, apparently there are 808 episodes. Have a look at this code, there is a lot going on here, but it's simple to understand. I tested the download for around 5-6 episodes and it works...

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from http.client import RemoteDisconnected
import time


def get_browser():
    chrome_options = Options()
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_argument('--disable-notifications')
    chrome_options.add_argument('--incognito')
    driver = webdriver.Chrome(options=chrome_options)
    return driver


driver = get_browser()
page_url = "https://www.kiss-anime.ws/Anime-detective-conan-1"

try:
    driver.set_page_load_timeout(40)
    driver.get(page_url)
except TimeoutException:
    raise Exception(f"\t{page_url} - Timed out receiving message from renderer")
except RemoteDisconnected:
    raise Exception(f"\tError 404: {page_url} not found.")

WebDriverWait(driver, timeout=40).until(EC.presence_of_element_located((By.ID, "selectEpisode")))
driver.find_element_by_id("selectEpisode").click()
soup = BeautifulSoup(driver.page_source, "html.parser")

options = soup.find("select", attrs={"id": "selectEpisode"}).find_all("option")
print(f"Found {len(options)} episodes...")


base_url = "https://www.kiss-anime.ws/"
for idx, option in enumerate(options):
    print(f"Downloading {idx+1} of {len(options)}...")
    page_url = option['value']

    try:
        driver.set_page_load_timeout(40)
        driver.get(page_url)
    except TimeoutException:
        print(f"\t{page_url} - Timed out receiving message from renderer")
        continue
    except RemoteDisconnected:
        print(f"\tError 404: {page_url} not found.")
        continue

    WebDriverWait(driver, timeout=40).until(EC.presence_of_element_located((By.ID, "divDownload")))
    driver.find_element_by_id("divDownload").click()
    print(f"\t Downloading...")
    time.sleep(15)


driver.quit()
print("done")

So, firstly, I'm opening up the URL in a chrome browser, and reading the dropdown values which are 808 in total. Now I am walking through each of those 808 URLs to fetch the actual link we need to download the video. By using the .click() in the loop, I'm actually simulating a button click and the video starts to download. Remember to change the time.sleep(x) where x should represent the approx time(seconds) it takes for you to download one episode according to your internet speed.

You need to install selenium and bs4 packages using pip install. Also, download chromedriver.exe and ensure it's in the same path as that of this script.

Web Scraping - Detective Conan

1 Answers