1
votes

I want to run scrapy as a python script, but I cannot figure out how to set the settings correctly or how I can provide them. I'm not sure whether it's an settings-issue, but I assume it.

My config:

  • Python 2.7 x86 (as virtual environment)
  • Scrapy 1.2.1
  • Win 7 x64

I took the advices from https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script to get it running. I have some issues with the following advice:

If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.

So what is meant wiht "inside a Scrapy project"? Of course I have to import the libraries and have the dependencies installed, but I want to avoid starting the crawling process with scrapy crawl xyz.

Here's the code of myScrapy.py

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse


#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')

parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url

#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
    urlToScanNoProt = urlToScan.replace("https://","")
    print "used protocol: https"
if "http" in urlToScan:
    urlToScanNoProt = urlToScan.replace("http://","")
    print "used protocol: http"

class myItem(Item):
    url = Field()

class mySpider(CrawlSpider):
    name = "linkspider"
    allowed_domains = [urlToScanNoProt]
    start_urls = [urlToScan,]
    rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )

    def generateDirs(self):
        if not os.path.exists(generalOutputDir):
            os.makedirs(generalOutputDir)
        specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
        if not os.path.exists(specificOutputDir):
            os.makedirs(specificOutputDir)
        return specificOutputDir

    def parse_url(self, response):
        for link in LinkExtractor().extract_links(response):
            item = myItem()
            item['url'] = response.url
        specificOutputDir = self.generateDirs()
        filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
        with open(filename, "wb") as f:
            f.write(response.body)
        return CrawlSpider.parse(self, response)
        return item

process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished

Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)? I think it's an issue with getting the settings as they are set within a "normal" scrapy-project (where you have to run scrapy crawl xyz) because the putput says 2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {} I hope you understand my question(s) (English isn't my native language... ;)) Thanks in advance!

1

1 Answers

6
votes

When running a crawl with a script (and not scrapy crawl), one of the options is indeed to use CrawlerProcess.

So what is meant wiht "inside a Scrapy project"?

What is meant is if you run your scripts at the root of a scrapy project created with scrapy startproject, i.e. where you have the scrapy.cfg file with the [settings] section among others.

Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?

Read the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:

Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it

I don't know that part of the framework, but I suspect with a spider name only -- I believe you meant and not process.crawl("linkspider") , and outside of a scrapy project, scrapy does not know where to look for spiders (it has no hint). Hence, to tell scrapy which spider to run, might as well give the class directly (and not an instance of a spider class).

get_project_settings() is a helper, but essentially, CrawlerProcess needs to be initialized with a Settings object (see https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess)

In fact, it also accepts a settings dict (which is internally converted into a Settings instance), as shown in the example you linked to:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

So depending on what settings you need to override compared to scrapy defaults, you need to do something like:

process = CrawlerProcess({
    'SOME_SETTING_KEY': somevalue,
    'SOME_OTHERSETTING_KEY': someothervalue,
    ...
})
process.crawl(mySpider)
...