I am using Scrapy to crawl sites, but I am wondering if there is a way to set it to only crawl only blog posts (i.e. not pages) of a website? I could probably create specific rules for each site to get it to work, but that would be too time consuming if I needed to crawl multiple sites. Is there a way to have one crawler that works universally across all sites to grab only blog posts? I doubt it, but my fingers are crossed some genius has an answer to this.
Here is the basic code I have so far pulled from the Scrapy documentation. What do I need to add to make this work?
from scrapy.contrib.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'crawlit'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
def parse_item(self, response):
#do something
P.S. I thought about just pulling the RSS feed, but RSS feeds only show recent posts -- which means I wouldn't be able to get posts older than a certain date. Unless someone knows a way around that?