0
votes

I've spent a long time trying to figure this out to no avail. I've read a lot about passing back HtmlResponse and using selenium middleware but have struggled to understand how to structure the code and implement into my solution.

Here is my spider code:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

count = 0

class ContractSpider(scrapy.Spider):

name = "contracts"

def start_requests(self):
    urls = [
        'https://www.contractsfinder.service.gov.uk/Search/Results',
    ]
    for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

def __init__(self):
    self.driver = webdriver.Firefox()
    self.driver.get("https://www.contractsfinder.service.gov.uk/Search/Results")
    elem2 = self.driver.find_element_by_name("open")
    elem2.click()
    sleep(5)
    elem = self.driver.find_element_by_name("awarded")
    elem.click()
    sleep(5)
    elem3 = self.driver.find_element_by_id("awarded_date")
    elem3.click()
    sleep(5)
    elem4 = self.driver.find_element_by_name("awarded_from")
    elem4.send_keys("01/03/2018")
    elem4.send_keys(Keys.RETURN)
    sleep(5)
    elem5 = self.driver.find_element_by_name("awarded_to")
    elem5.send_keys("16/03/2018")
    elem5.send_keys(Keys.RETURN)
    sleep(5)
    elem6 = self.driver.find_element_by_name("adv_search")
    self.driver.execute_script("arguments[0].scrollIntoView(true);", elem6)
    elem6.send_keys(Keys.RETURN)

def parse(self, response):
    global count
    count += 1
    strcount = str(count)
    page = self.driver.get(response.url)
    filename = strcount+'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

    for a in response.css('a.standard-paginate-next'):
        yield response.follow(a, callback=self.parse)

The selenium part is working in that firefox is called, various java interactions are taking place and a final page of results is loaded.

The scrapy part of the code seems to be working (in that it finds the next button of the selenium loaded firefox webdriver and clicks through - I can see this by watching the webdriver firefox itself) - however, the actual scraping taking place (which is saving down HTML onto my c:\ drive) is scraping the URL 'https://www.contractsfinder.service.gov.uk/Search/Results' separately and without the selenium induced java interactions from the firefox webdriver.

I think I understand some of the reasons as to why this isn't working as I want it to, for example in the start_requests I'm referring to the original URL which means that the selenium loaded page is not used by the spider, but every time I've tried to create a response back from the webdriver by using a wide variety of different methods from reading stackoverflow, I get a variety of errors as my understanding isn't good enough - thought i'd post a version where the selenium & scrapy elements are doing something, but please can someone explain and show me the best approach to linking the 2 elements together ie, once selenium has finished - use the firefox webdriver loaded page and pass it to scrapy to do its stuff? Any feedback much appreciated.

2

2 Answers

2
votes

As you said, scrapy opens your initial url, not the page modified by Selenium.

If you want to get page from Selenium, you should use driver.page_source.encode('utf-8') (encoding is not compulsory). You can also use it with scrapy Selector:

response = Selector(text=driver.page_source.encode('utf-8'))

After it work with response as you used to.

EDIT:

I would try something like this (notice, I haven't tested the code):

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

count = 0

class ContractSpider(scrapy.Spider):

    name = "contracts"

    def start_requests(self):
        urls = [
            'https://www.contractsfinder.service.gov.uk/Search/Results',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def __init__(self):
        driver = webdriver.Firefox()
        # An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
        # (or elements) not immediately available.
        driver.implicitly_wait(5)

    @staticmethod
    def get__response(url):
        self.driver.get("url")
        elem2 = self.driver.find_element_by_name("open")
        elem2.click()
        elem = self.driver.find_element_by_name("awarded")
        elem.click()
        elem3 = self.driver.find_element_by_id("awarded_date")
        elem3.click()
        elem4 = self.driver.find_element_by_name("awarded_from")
        elem4.send_keys("01/03/2018")
        elem4.send_keys(Keys.RETURN)
        elem5 = self.driver.find_element_by_name("awarded_to")
        elem5.send_keys("16/03/2018")
        elem5.send_keys(Keys.RETURN)
        elem6 = self.driver.find_element_by_name("adv_search")
        self.driver.execute_script("arguments[0].scrollIntoView(true);", elem6)
        elem6.send_keys(Keys.RETURN)
        return self.driver.page_source.encode('utf-8')

    def parse(self, response):
        global count
        count += 1
        strcount = str(count)
        # Here you got response from webdriver
        # you can use selectors to extract data from it
        selenium_response = Selector(text=self.get_selenium_response(response.url))
    ...
1
votes

Combine the solution from @Alex K and others, here is my tested code:

import scrapy
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

...

def __init__(self, name=None, **kwargs):
    super(MySpider, self).__init__(name, **kwargs)
    self.driver = webdriver.Chrome()

@staticmethod
def get_selenium_response(driver, url):
    driver.get(url)
    # in case of explicit amount of time
    # time.sleep(5) 
    # in case of wait until element been found
    try:
        def find(driver):
            table_el = driver.find_element_by_xpath('//*[@id="table_el"]')
            if table_el:
                return table_el
            else:
                return False
        element = WebDriverWait(driver, 5).until(find)
        return driver.page_source.encode('utf-8')
    except:
        driver.quit()

def parse(self, response):
    response = scrapy.Selector(
        text=self.get_selenium_response(self.driver, response.url))
    # ...parse response as normally