Scrapy - Crawl and Scrape a website

Question

As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data,

The output of my code is as follows:

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13',
              u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14',
              u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15',
              u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16',
              u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17',
              u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18',
              u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19',
              u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20',
              u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21',
              u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22',
              u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23',
              u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'],
     'title': [u'ObamaCare Survival Guide',
               u'The Official SAT Study Guide, 2nd edition',
               u'Inferno: A Novel (Robert Langdon)',
               u'A Memory of Light (Wheel of Time)',
               u'Jesus Calling: Enjoying Peace in His Presence',
               u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy',
               u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set',
               u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health',
               u'Publication Manual of the American Psychological Association, 6th Edition',
               u'The One and Only Ivan',
               u'Inquebrantable (Spanish Edition)'],
     'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633',
     'visit_status': 'new'}

But, I wanted the output to be captured like this,

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'],
     'title': [u'ObamaCare Survival Guide']}

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'],
     'title': [u'The Official SAT Study Guide, 2nd edition']}

I think its not a problem with the scrapy or the crawler, but with the FOR loop written.

Following is the code,

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from Amaze.items import AmazeItem

class AmazeSpider2(CrawlSpider):
    name = "scanon"
    allowed_domains = ["www.amazon.com"]
    start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"]

    rules = (
        Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True),
        )

    def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('//a[@class="title"]/text()').extract()
            item["link"] = title.select('//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

Any assistance!

akhter wahab akhter wahab · Accepted Answer · 2013-02-25T07:12:19

problem is in your Xpath

def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('.//a[@class="title"]/text()').extract()
            item["link"] = title.select('.//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

in above Xpaths you needs to use . in xpath to look into title only other wise your xpath will look on whole page , so it will get allot of matches and will return them,

Scrapy - Crawl and Scrape a website

3 Answers