0
votes

I'm quite new to webscraping. I'm trying to crawl at pages after succesfully logging in to the quotes.toscrape.com website. My code (scrapytest/spiders/quotes_spider.py) is as follows:

import scrapy
from scrapy.http import FormRequest
from ..items import ScrapytestItem
from scrapy.utils.response import open_in_browser
from scrapy.spiders.init import InitSpider


class QuoteSpider(scrapy.Spider):
    name = 'scrapyquotes'
    login_url = 'http://quotes.toscrape.com/login'
    start_urls = [login_url]

    def parse(self,response):
        token = response.css('input[name="csrf_token"]::attr(value)').extract_first()
        yield scrapy.FormRequest(url=self.login_url,formdata={
            'csrf_token':token,
            'username':'roberthng',
            'password':'dsadsadsa'
        },callback = self.start_scraping)

    def start_scraping(self,response):
        items = ScrapytestItem()
        all_div_quotes=response.css('div.quote')

        for quotes in all_div_quotes:
            title = quotes.css('span.text::text').extract()
            author = quotes.css('.author::text').extract()
            tag = quotes.css('.tag::text').extract()

            items['title'] = title
            items['author'] = author
            items['tag'] = tag

            yield items

        #Go to Next Page:     
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Whenever i run this code through $ scrapy crawl scrapyquotes on the terminal (VSC), the code only manage to scrape to login and scrape the first page. It always fail to crawl to the second page. Bellow is the error message that appears:

2020-10-10 12:26:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)

2020-10-10 12:26:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)

I suspected that this has something to do with the start_urls, however when I changed it to 'http://quotes.toscrape.com/page/1' the code doesn't even scrape the first page. Can anyone help me work this code out? Thank you in advance!

Full Error Log:

2020-10-10 12:26:40 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapytest)
2020-10-10 12:26:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0 
2020-10-10 12:26:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-10 12:26:40 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapytest',
 'NEWSPIDER_MODULE': 'scrapytest.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['scrapytest.spiders']}
2020-10-10 12:26:40 [scrapy.extensions.telnet] INFO: Telnet Password: 92d2fd08391e76a9
2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapytest.pipelines.ScrapytestPipeline']
2020-10-10 12:26:40 [scrapy.core.engine] INFO: Spider opened
2020-10-10 12:26:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-10 12:26:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2020-10-10 12:26:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2020-10-10 12:26:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Albert Einstein'],
 'tag': ['change', 'deep-thoughts', 'thinking', 'world'],
 'title': ['“The world as we have created it is a process of our thinking. It '
           'cannot be changed without changing our thinking.”']}
2020-10-10 12:26:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['J.K. Rowling'],
 'tag': ['abilities', 'choices'],
 'title': ['“It is our choices, Harry, that show what we truly are, far more '
           'than our abilities.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Albert Einstein'],
 'tag': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
 'title': ['“There are only two ways to live your life. One is as though '
           'nothing is a miracle. The other is as though everything is a '
           'miracle.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Jane Austen'],
 'tag': ['aliteracy', 'books', 'classic', 'humor'],
 'title': ['“The person, be it gentleman or lady, who has not pleasure in a '
           'good novel, must be intolerably stupid.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Marilyn Monroe'],
 'tag': ['be-yourself', 'inspirational'],
 'title': ["“Imperfection is beauty, madness is genius and it's better to be "
           'absolutely ridiculous than absolutely boring.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Albert Einstein'],
 'tag': ['adulthood', 'success', 'value'],
 'title': ['“Try not to become a man of success. Rather become a man of '
           'value.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['André Gide'],
 'tag': ['life', 'love'],
 'title': ['“It is better to be hated for what you are than to be loved for '
           'what you are not.”']}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Thomas A. Edison'],
 'tag': ['edison', 'failure', 'inspirational', 'paraphrased'],
 'title': ["“I have not failed. I've just found 10,000 ways that won't work.”"]}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Eleanor Roosevelt'],
 'tag': ['misattributed-eleanor-roosevelt'],
 'title': ['“A woman is like a tea bag; you never know how strong it is until '
           "it's in hot water.”"]}
2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'author': ['Steve Martin'],
 'tag': ['humor', 'obvious', 'simile'],
 'title': ['“A day without sunshine is like, you know, night.”']}
2020-10-10 12:26:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)
2020-10-10 12:26:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)
Traceback (most recent call last):
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
    return next(self.data)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
    return next(self.data)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Robert\Documents\Demos\vstoolbox\scrapytest\scrapytest\spiders\quotes_spider.py", line 15, in parse
    yield scrapy.FormRequest(url=self.login_url,formdata={
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 104, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got NoneType
2020-10-10 12:26:42 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-10 12:26:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1832,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 8041,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 2.063919,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 10, 5, 26, 42, 486494),
 'item_scraped_count': 10,
 'log_count/DEBUG': 15,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'request_depth_max': 2,
 'response_received_count': 4,
 'robotstxt/request_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2020, 10, 10, 5, 26, 40, 422575)}
2020-10-10 12:26:42 [scrapy.core.engine] INFO: Spider closed (finished)

Code for other files in the directory:

(scrapytest/items.py)

import scrapy
    
class ScrapytestItem(scrapy.Item):
#define the fields for your item here like:
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()`

(scrapytest/pipelines.py)

from itemadapter import ItemAdapter
import sqlite3


class ScrapytestPipeline(object):        
    def __init__(self):
        self.create_connection()
        self.create_table()

    def create_connection(self):
        self.conn = sqlite3.connect('myquotes.db')
        self.curr = self.conn.cursor()
    
    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS quotes_tb""")
        self.curr.execute("""create table quotes_tb(
                            title text,
                            author text,
                            tag text
                            )""")

    def process_item(self, item, spider):
            self.store_db(item) 
            #print("Pipeline :" + item['title'][0])
            return item

    def store_db(self, item):
        self.curr.execute("""insert into quotes_tb values(?,?,?)""",(
            item['title'][0],
            item['author'][0],
            item['tag'][0]
        ))
        self.conn.commit()

(scrapytest/settings.py)

BOT_NAME = 'scrapytest'

SPIDER_MODULES = ['scrapytest.spiders']
NEWSPIDER_MODULE = 'scrapytest.spiders'
ITEM_PIPELINES = {
    'scrapytest.pipelines.ScrapytestPipeline': 300,
}
2
Can you show the rest of your code and full error logs ? - YukiShioriii
@YukiShioriii done, edited on my original post - freudslipper
your first code snippet ended at this ? if next_page is not None:, i feel like you're doing something wrong at the next page yielding part - YukiShioriii
Hey @YukiShioriii sorry i missed that part. It should be yield ~response.follow(next_page, callback=self.parse)~ , i have updated it on the original post. - freudslipper

2 Answers

1
votes

you're passing wrong function into callback, your self.parse function can only work with the login page.

if next_page is not None:
    yield response.follow(next_page, callback=self.start_scraping)
0
votes

This is from your execution logs:

  File "C:\Users\Robert\Documents\Demos\vstoolbox\scrapytest\scrapytest\spiders\quotes_spider.py", line 15, in parse
    yield scrapy.FormRequest(url=self.login_url,formdata={
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 104, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got NoneType

To put it shortly, it's telling you that an argument in your formdata parameter is None, but it's expected to be "a str or bytes object". Given that your formdata has three fields, only one is a variable, token must be returning empty.

    ...
    token = response.css('input[name="csrf_token"]::attr(value)').extract_first()
    yield scrapy.FormRequest(url=self.login_url,formdata={
        'csrf_token':token,
        'username':'roberthng',
        'password':'dsadsadsa'
    },callback = self.start_scraping)

However, your selector returns a value correctly if you are in the login page. My hypothesis is that when you define the request for the next page, you are setting the callback to your parse method (or not setting it at all, which leads to parse as default). And I say hypothesis, because you didn't post that part of your code. Your code sample stops here:

    #Go to Next Page:     
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:

So make sure what comes after this, is setting the callback function correctly for the request.