0
votes

I'm using the SgmlLinkExtractor functionality in scrapy to parse specific urls.

I override start_requests function to crawl dynamic url.

this looks like:

start_requests(self): ..... yield Requests(url.strip(), callbackA)

Callback A does nothing right now.

I also implemented process_value for the SgmlLinkExtractor but it never called.

This is the rule I'm using:

rules = [Rule(SgmlLinkExtractor(allow=()), callback=callbackB, follow=True),]

Again callbackB never called.

1
Welcome to Stack Overflow! Can you explain what you've tried so far?Jeff Tratner
This is what I want to achieve: I want to scan all a and href tags in a site and do some logic to determine rather to jump to that URL. The logic is as follows: search if the link or the link description contain career | job (not case sensitive) if so create a link to parse it. - the link may contain ../ so need to get rid of them.DjangoPy
okay, but what regular expression have you tried building already? Has your bot worked? If not, what error message did you get? Can you post a (small, very minimal!) example of input you want to parse and your expected output?Jeff Tratner
I don't have the regular expression. my idea was to use SgmlLinkExtractor and to allow all link and pass a callback function for process_value that will do all logic. what do you think?DjangoPy
sorry, you're right, I misread. However, you should post some code that you've tried so far and examples of input and output (such it's actually possible to do with scrapy). That's the best way to get help on SO.Jeff Tratner

1 Answers

0
votes

If your callbacks are declared in your spider, then they will not have global scope and you need to reference them as scoped to your class with self.:

rules = [
  Rule(SgmlLinkExtractor(), callback=self.callbackB, follow=True),
]