0
votes

I am a beginner in scrapy, python. I tried to deploy the spider code in scrapinghub and I encountered the following error. Below is the code.

import scrapy
from bs4 import BeautifulSoup,SoupStrainer
import urllib2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import re
import pkgutil
from pkg_resources import resource_string
from tues1402.items import Tues1402Item

data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):     
    name = 'tuesday'
    self.start_urls = [url.strip() for url in data]
    def parse(self, response):
       story = Tues1402Item()
       story['url'] = response.url
       story['title'] = response.xpath("//title/text()").extract()
       return story

is my spider.py code

import scrapy
class Tues1402Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field() 

is the items.py code and

from setuptools import setup, find_packages
setup(
    name         = 'tues1402',
    version      = '1.0',
    packages     = find_packages(),
    entry_points = {'scrapy': ['settings = tues1402.settings']},
    package_data = {'tues1402':['resources/urllist.txt']},
    zip_safe = False,
)

is the setup.py code.

The following is the error.

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 126, in _next_request request = next(slot.start_requests) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 70, in start_requests yield self.make_requests_from_url(url) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 73, in make_requests_from_url return Request(url, dont_filter=True) File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 25, in init self._set_url(url) File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 57, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: h

Thank you in Advance

1

1 Answers

0
votes

Your error means that url h is not a valid url. You should print out your self.start_urls and see what urls you have there, you most likely have a string h as your first url.

Seems like your spider iterates through text instead of a list of urls here:

data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):     
    name = 'tuesday'
    self.start_urls = [url.strip() for url in data]

Assuming that you store your urls with some separator in urllist.txt file you should split that:

# assuming file has url in every line
self.start_urls = [url.strip() for url in data.splitlines()]