I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
@staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[@class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
- One problem is that I cannot find a way to control the crawling and scrapping.
- Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
- I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
@staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[@class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience