3
votes

I recently made a webscraper with python and Selenium, and i found it pretty simple to do. The page used ajax calls to load the data, and initialy i waited a fixed time_out to load the page. That worked for a while. After that, I found that selenium has a built in function, WebDriverWait which can wait for a specific element to load, using wait.until(). This made my webscraper run faster.

The problem is, i still was not satisfied with the results. It took me an average of 1.35seconds per page to download the content.

I tried to paralelize this but the time's did not get better because the creation if the driver instance (with Chrome or PhantomJS) took most of the scraping time.

So I turned myself to scrapy. After doing the tutorials, and having my parser already written, my two questions are:

1) does scrapy automatically run multiple url requests in paralel?

2) how can i set a dynamic time out with scrapy, like the WebDriverWait wait.until() of Selenium

3) if there is no dynamic set out time available for scrapy, and the solution is to use scrapy + selenium, to let selenium wait till the content is loaded, is there really any advantage of using scrapy? I could simlply retrieve the data using selenium selectors, like i was doing before using scrapy

Thank you for you help.

1

1 Answers

2
votes
  1. Yes, Scrapy can process multiple requests concurrently, which are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. In short, its extremely fast and can be configured to behave exactly how you want it.

  1. Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. Using this in Scrapy, you can work with dynamic content like that with Selenium. By default Splash waits for all remote resources to load, but in most cases it is better not to wait for them forever. To abort resource loading after a timeout and give the whole page a chance to render use resource timeout, either splash.resource_timeout or request:set_timeout can be set.

  1. Again, the big difference I feel comes in the speed of the scraping process for different implementations. And since Scrapy handles things Asynchronously, that gives it a big advantage over others.