I've spent a long time trying to figure this out to no avail. I've read a lot about passing back HtmlResponse and using selenium middleware but have struggled to understand how to structure the code and implement into my solution.
Here is my spider code:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
count = 0
class ContractSpider(scrapy.Spider):
name = "contracts"
def start_requests(self):
urls = [
'https://www.contractsfinder.service.gov.uk/Search/Results',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.get("https://www.contractsfinder.service.gov.uk/Search/Results")
elem2 = self.driver.find_element_by_name("open")
elem2.click()
sleep(5)
elem = self.driver.find_element_by_name("awarded")
elem.click()
sleep(5)
elem3 = self.driver.find_element_by_id("awarded_date")
elem3.click()
sleep(5)
elem4 = self.driver.find_element_by_name("awarded_from")
elem4.send_keys("01/03/2018")
elem4.send_keys(Keys.RETURN)
sleep(5)
elem5 = self.driver.find_element_by_name("awarded_to")
elem5.send_keys("16/03/2018")
elem5.send_keys(Keys.RETURN)
sleep(5)
elem6 = self.driver.find_element_by_name("adv_search")
self.driver.execute_script("arguments[0].scrollIntoView(true);", elem6)
elem6.send_keys(Keys.RETURN)
def parse(self, response):
global count
count += 1
strcount = str(count)
page = self.driver.get(response.url)
filename = strcount+'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
for a in response.css('a.standard-paginate-next'):
yield response.follow(a, callback=self.parse)
The selenium part is working in that firefox is called, various java interactions are taking place and a final page of results is loaded.
The scrapy part of the code seems to be working (in that it finds the next button of the selenium loaded firefox webdriver and clicks through - I can see this by watching the webdriver firefox itself) - however, the actual scraping taking place (which is saving down HTML onto my c:\ drive) is scraping the URL 'https://www.contractsfinder.service.gov.uk/Search/Results' separately and without the selenium induced java interactions from the firefox webdriver.
I think I understand some of the reasons as to why this isn't working as I want it to, for example in the start_requests I'm referring to the original URL which means that the selenium loaded page is not used by the spider, but every time I've tried to create a response back from the webdriver by using a wide variety of different methods from reading stackoverflow, I get a variety of errors as my understanding isn't good enough - thought i'd post a version where the selenium & scrapy elements are doing something, but please can someone explain and show me the best approach to linking the 2 elements together ie, once selenium has finished - use the firefox webdriver loaded page and pass it to scrapy to do its stuff? Any feedback much appreciated.