Using CSS with Scrapy to extract all text without tags - failing

Question

I see a lot of Xpath answers but no CSS ones. I have had success extracting all the text I require - but it's totally 'wrapped'? in tags, font details, etc. I am pulling a few of the role descriptions off this site.

The code I am using is adapted from the Scrapy tutorial - I want to extract all the job-related text off the site for each role:

def parse(self, response):
    for href in response.css('.mask-on-hover + a::attr(href)'):
        yield response.follow(href, self.parse_author)

def parse_author(self, response):
    def extract_with_css(query):

        return response.css(query).extract()

    yield {
        'role': extract_with_css('h1::text'),
        'literature': extract_with_css('h3 span.info::text'),
        'date-posted': extract_with_css('h3 span#ctl00_spListed.info.listed::text'),
        'role-description': extract_with_css('#ctl00_regionContent_lblJobDescription span , strong::text'),}

My result for the particular page includes all the text, but also the html tags and elements including, span, style, font-size.

How do I get clean text in order of appearance on the site using CSS? Ideally I would like to keep the paragraph styles and deliver it to one cell in Excel/CSV ultimately.

Thank you!

Wilfredo Wilfredo · Accepted Answer · 2017-10-31T18:41:42

If the css selectors are exactly what you want you could use the remove_tags method from w3lib, but I don't think it's necessary in your case, please try this:

'role-description': extract_with_css('#ctl00_regionContent_lblJobDescription span *::text')

Using CSS with Scrapy to extract all text without tags - failing

1 Answers