1
votes

I am using Scrapy to crawl sites, but I am wondering if there is a way to set it to only crawl only blog posts (i.e. not pages) of a website? I could probably create specific rules for each site to get it to work, but that would be too time consuming if I needed to crawl multiple sites. Is there a way to have one crawler that works universally across all sites to grab only blog posts? I doubt it, but my fingers are crossed some genius has an answer to this.

Here is the basic code I have so far pulled from the Scrapy documentation. What do I need to add to make this work?

from scrapy.contrib.spiders import CrawlSpider

class MySpider(CrawlSpider):
    name = 'crawlit'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def parse_item(self, response):
        #do something

P.S. I thought about just pulling the RSS feed, but RSS feeds only show recent posts -- which means I wouldn't be able to get posts older than a certain date. Unless someone knows a way around that?

1

1 Answers

0
votes

You could use a library like python-readability to extract all article text from a given url to qualify it as a "blog post"

from readability.readability import Document

def parse_url(self, response):
    html = response.body
    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

If you want only articles from a website, it might be worth checking to see if the site has an RSS feed?