2
votes

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.

I have used the following code for my spider from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector

from tutorial.items import TutorialItem


class WebCrawler(CrawlSpider):
    name = "flipkart"
    allowed_domains = ['flipkart.com']
    start_urls = ['http://www.flipkart.com/store-directory']
    rules = [
        Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
        Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
    ]

    @staticmethod
    def parse_flipkart(response):
        hxs = HtmlXPathSelector(response)
        item = FlipkartItem()
        item['featureKey'] = hxs.select('//td[@class="specsKey"]/text()').extract()
        yield item

What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.

  1. One problem is that I cannot find a way to control the crawling and scrapping.
  2. Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
  3. I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.

Suggestions are welcome..:)

ADDITIONAL DETAILS

I had earlier used a similar approach

the second rule I used was

Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)

@staticmethod
def parse_category(response):
    hxs = HtmlXPathSelector(response)
    count = hxs.select('//td[@class="no_of_items"]/text()').extract()
    for page num in range(1,count,15):
        ajax_url = response.url+"&start="+num+"&ajax=true"
        return Request(ajax_url,callback="parse_category")

Now i was confused on what to use for callback "parse_category" or "parse_flipkart"

Thank you for your patience

1

1 Answers

3
votes
  1. Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:

    response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
    
  2. You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.

    enter image description here

    For a more detail overview of scraping with Firebug, you can check out the official documentation.

  3. Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs: