Scrapy + selenium requests twice for each url

Question

import scrapy 
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

selenium with scrapy for dynamic page

This solution works good, but it requests twice for the same url one by scrapy scheduler and another by the selenium web driver.

It'll take twice the time to finish the job, compared to scrapy request without selenium. How to avoid this?

Why do you want to use scrapy if you are already using Selenium? — VMRuiz
@VMRuiz scrapy isn't about just request response and html parsing. It comes with lot more capabilities within it and the most interesting thing is concurrency. — Yash Pokar
In that case, if you only want to render the webpage you can use scrapy + splash: splash.readthedocs.io/en/stable — VMRuiz
I have used splash but I couldn't fetch result for one site. where chrome and firefox are well known browser which will gives the 100% result. — Yash Pokar

Yash Pokar Yash Pokar · Accepted Answer · 2018-06-06T08:15:39

Here is a trick that can be useful to solve this problem.

create a web service for the selenium run it, locally

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)

now http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar will return you the compiled web page using selenium Chrome/Firefox driver.

now how our spider will look like,

import scrapy
import urllib


class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['ebay.com']
    urls = [
        'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
    ]

    def start_requests(self):
        for url in self.urls:
            url = 'http://127.0.0.1:5000/?url={}'.format(urllib.quote(url))
            yield scrapy.Request(url)

    def parse(self, response):
        yield {
            'field': response.xpath('//td[@class="pagn-next"]/a'),
        }

Scrapy + selenium requests twice for each url

1 Answers