0
votes

rules = ( Rule(LinkExtractor( restrict_xpaths='//need_data', deny=deny_urls), callback='parse_info'), Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), )

rules to extract need URLs for scraping, right?
Can I in callback def get URL we move?
For example.
website - needdata.com
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), to extract URL like needdata.com/need/1 , right?

    Rule(LinkExtractor(
        restrict_xpaths='//need_data',
        deny=deny_urls), callback='parse_info'),

to extract urls from needdata.com/need/1 , for example it a table with people.
and then parse_info to scrape it. Right?
But I want to understand in parse_info who a parent?
If needdata.com/need/1 has needdata.com/people/1
I want to add to a file column parent and data will be needdata.com/need/1
How to do that? Thank you very much.

1

1 Answers

0
votes

We want to use

lx = LinkExtractor(allow=(r'shop-online/',))

And then

for l in lx.extract_links(response):
    # l.url - it our url

And then use

meta={'category': category}

The better decision I do not find.