1
votes

I have this scraper that works perfectly fine when I call it from command line. like,

scrapy crawl generic

and this is how my scraper looks.

import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]


    def parse_item(self,response):
        extract some data and store it somewhere

I'm trying to use this spider from a python script. and I followed the documentation http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

and this is what the script looks like,

from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess
import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]




    def parse_item(self,response):
        extract some data and store it somewhere

settings=Settings()
settings.set('DEPTH_LIMIT',1)

process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()

This is what I see on the terminal when i run from script

Desktop $ python newspider.py  
2015-10-14 21:46:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-14 21:46:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 21:46:39 [scrapy] INFO: Overridden settings: {'DEPTH_LIMIT': 1}
2015-10-14 21:46:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 21:46:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 21:46:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 21:46:39 [scrapy] INFO: Enabled item pipelines: 
2015-10-14 21:46:39 [scrapy] INFO: Spider opened
2015-10-14 21:46:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 21:46:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 21:46:39 [scrapy] DEBUG: Redirecting (302) to <GET http://thevine.com.au/> from <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'thevine.com.au': <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET http://www.twitter.com/thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?u=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/intent/tweet?text=Leonardo+DiCaprio+is+Producing+A+Movie+About+The+Volkswagen+Emissions+Scandal&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F&via=thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET http://plus.google.com/share?url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'pinterest.com': <GET http://pinterest.com/pin/create/button/?media=http%3A%2F%2Fs3-ap-southeast-2.amazonaws.com%2Fthevine-online%2Fwp-content%2Fuploads%2F2015%2F10%2F13202447%2FScreen-Shot-2015-10-14-at-7.24.25-AM.jpg&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] INFO: Closing spider (finished)
2015-10-14 21:46:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 28536,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 14, 16, 16, 41, 270707),
 'log_count/DEBUG': 10,
 'log_count/INFO': 7,
 'offsite/domains': 7,
 'offsite/filtered': 139,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 10, 14, 16, 16, 39, 454120)}

In this case, the start_url was http://thevine.com.au/ and allowed_domains: thevine.com.au
The same starturl and domain when given to the spider running as a scrapy project gives this,

$ scrapy crawl generic -a start="http://thevine.com.au/" -a domains="thevine.com.au"
2015-10-14 22:14:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: mary)
2015-10-14 22:14:45 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 22:14:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mary.spiders', 'SPIDER_MODULES': ['mary.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'mary'}
2015-10-14 22:14:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 22:14:46 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 22:14:46 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 22:14:46 [scrapy] INFO: Enabled item pipelines:
2015-10-14 22:14:46 [scrapy] INFO: Spider opened
2015-10-14 22:14:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 22:14:46 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 22:14:47 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 22:14:47 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
.
.
2015-10-14 22:14:48 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/category/entertainment/> (referer: http://thevine.com.au/)

2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/ 
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/viral/
.
.

2015-10-14 22:16:10 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/gear/tech/elon-musk-plans-to-launch-4000-satellites-to-bring-wi-fi-to-most-remote-locations-on-earth/> (referer: http://thevine.com.au/)  
2015-10-14 22:19:31 [scrapy] INFO: Crawled 26 pages (at 16 pages/min), scraped 0 items (at 0 items/min)

and so on, it just keeps going.

So basically this is what i understand about what happens when i run from the script
Rule is not followed at all. My parse_item callback doesn't work.And any callback other than the default parse doesn't work. It only crawls the urls in start_urls and only calls back to the default parse method if included.

1
you are passing ImgurSpider instead of MySpidereLRuLL
sorry, that was a mistake. I had changed ImgurSpider to MySpider for the question and forgot to change it in that line. Corrected question now.Jeff P Chacko

1 Answers

2
votes

you need to pass an instance of the Spider Class to the .crawl method.

...
spider = MySpider()
process.crawl(spider)
...

but it should still work as you are doing it.

Logs show that you are doing offsite requests, try removing allowed_domains from the Spider (if you don't care about it) but you could also pass domain on the process.crawl:

process.crawl(spider, domain="mydomain")