I'm scraping a list of pages, I have
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
Now, when I make the scraping, if the page exist, the url it change, when I try:
response.url
or
response.request
I don't get
'page_1_id', 'page_2_id', 'page_1_2', 'page_3_id'
since scrapy make asyncronous request I need the 'id' to match the data back, so what I need is to pass the 'id; as argument in each request, I thougtht on a list
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
id = ['id_1','id_2','id_3']
But have to issues, first of all I don't know how to pass this arguments, and second it won't work since I don't the order at wich request are been made. So I would probably need to use a dictionary , there is a way to make something like this:
start_urls = {'page_1_id':id_1, 'page_2_id':id_2, 'page_1_3':id_3, 'page_4_id':id_4}
My spider is quite simple, I just need to get a link and the id back:
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[@id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/@href').extract())
return myItem
Just need to add the 'id'
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[@id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/@href').extract(),id)
return myItem