So, I have a crawler which need to extract some data from meta tags in the head and some element tags in the body.
When I try this
for courses in response.xpath("//html"):
and this
for courses in response.xpath("//head"):
it only fetches data from meta tags in the <head>... </head>
tag.
when I try this
for courses in response.xpath("//body"):
it only fetch data from tags within the html <body>... </body>
tag.
How do I combine these 2 selectors, I also tried
for courses in response.xpath("//head | //body"):
but it only returned 'meta' tags from <head>... </head>
, nothing was extracted from body.
I also tried this
for courses in response.xpath("//*"):
it works, But this is highly inefficient and takes a lot of time to extract. I am sure there is a more efficient way to do this.
And here is Scrapy code, if it helps ...
The first 2 elements (pagetype, pagefeatured) under yeild are in <head> ... <head>
tag. The last 2 elements (coursetloc, coursetfees) are in <body ... </body>
tag
And Yes, it may look odd, but there are 'meta' tags inside <body>...</body>
in the website from where I am scraping.
class MySpider(BaseSpider):
name = "dkcourses"
start_urls = ['http://www.example.com/scrapy/all-courses-listing']
allowed_domains = ["example.com"]
def parse(self, response):
hxs = Selector(response)
for courses in response.xpath("//body"):
yield {
'pagetype': ''.join(courses.xpath('.//meta[@name="dkpagetype"]/@content').extract()),
'pagefeatured': ''.join(courses.xpath('.//meta[@name="dkpagefeatured"]/@content').extract()),
'coursetloc': ''.join(courses.xpath('.//meta[@name="dkcoursetloc"]/@content').extract()),
'coursetfees': ''.join(courses.xpath('.//meta[@name="dkcoursetfees"]/@content').extract()),
}
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract()):
yield Request(response.urljoin(url), callback=self.parse)
Any help is very appreciated. Thanks