How to crawl local HTML file with Scrapy

Question

I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".

Is it possible to crawl local HTML files in a local computer(Mac)?
If possible, how should I set parameters like "allowed_domains" and "start_urls"?

[Scrapy command]

$ scrapy crawl test -o test01.csv

[Scrapy spider]

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = []
    start_urls = ['file:///Users/Name/Desktop/test/test.html']

[Errors]

2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'

Hi @Baka, I'm glad that the issue is now resolved. FYI that I've just rolled back your last edit to the question. Reason: Fixing your question and making it the "correct" version would confuse future readers, especially those who has the similar issue and seeks help. — starrify
@starrify, I agree with your opinion, you are right. Thank you for making my question valuable :-) — Baka

Japes Japes · Accepted Answer · 2018-11-15T18:19:28

When working with it locally, I never specify the allowed_domains. Try to take that line of code out and see if it works.

In your error its testing the 'empty' domain that you have given it.

How to crawl local HTML file with Scrapy

1 Answers