I see a lot of Xpath answers but no CSS ones. I have had success extracting all the text I require - but it's totally 'wrapped'? in tags, font details, etc. I am pulling a few of the role descriptions off this site.
The code I am using is adapted from the Scrapy tutorial - I want to extract all the job-related text off the site for each role:
def parse(self, response):
for href in response.css('.mask-on-hover + a::attr(href)'):
yield response.follow(href, self.parse_author)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract()
yield {
'role': extract_with_css('h1::text'),
'literature': extract_with_css('h3 span.info::text'),
'date-posted': extract_with_css('h3 span#ctl00_spListed.info.listed::text'),
'role-description': extract_with_css('#ctl00_regionContent_lblJobDescription span , strong::text'),}
My result for the particular page includes all the text, but also the html tags and elements including, span, style, font-size.
How do I get clean text in order of appearance on the site using CSS? Ideally I would like to keep the paragraph styles and deliver it to one cell in Excel/CSV ultimately.
Thank you!