1
votes

I'm scraping a list of pages, I have

start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']

Now, when I make the scraping, if the page exist, the url it change, when I try:

response.url 

or

response.request

I don't get

'page_1_id', 'page_2_id', 'page_1_2', 'page_3_id'

since scrapy make asyncronous request I need the 'id' to match the data back, so what I need is to pass the 'id; as argument in each request, I thougtht on a list

start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']

id = ['id_1','id_2','id_3']

But have to issues, first of all I don't know how to pass this arguments, and second it won't work since I don't the order at wich request are been made. So I would probably need to use a dictionary , there is a way to make something like this:

start_urls = {'page_1_id':id_1, 'page_2_id':id_2, 'page_1_3':id_3, 'page_4_id':id_4}

My spider is quite simple, I just need to get a link and the id back:

def parse(self, response):


    myItem = Item()
    myItem = Item(link=response.xpath('//*[@id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/@href').extract())

    return myItem

Just need to add the 'id'

def parse(self, response):

myItem = Item()
myItem = Item(link=response.xpath('//*[@id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/@href').extract(),id)

return myItem
1

1 Answers

3
votes

You can override how scrapy starts yielding requests by overriding start_requests() method. Seems like you want to do that and then put the id in request.meta attribute to carry it over to parse callback. Something like:

start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, 
                             meta={'page_id': url.split('_',1)[-1] # 1_id})

def parse(self, response):
    print(response.meta['page_id'])
    # 1_id