0
votes

So, I have a crawler which need to extract some data from meta tags in the head and some element tags in the body.

When I try this

for courses in response.xpath("//html"):

and this

for courses in response.xpath("//head"):

it only fetches data from meta tags in the <head>... </head> tag.

when I try this

for courses in response.xpath("//body"):

it only fetch data from tags within the html <body>... </body> tag.

How do I combine these 2 selectors, I also tried

for courses in response.xpath("//head | //body"):

but it only returned 'meta' tags from <head>... </head>, nothing was extracted from body.

I also tried this

for courses in response.xpath("//*"):

it works, But this is highly inefficient and takes a lot of time to extract. I am sure there is a more efficient way to do this.

And here is Scrapy code, if it helps ...

The first 2 elements (pagetype, pagefeatured) under yeild are in <head> ... <head> tag. The last 2 elements (coursetloc, coursetfees) are in <body ... </body> tag

And Yes, it may look odd, but there are 'meta' tags inside <body>...</body> in the website from where I am scraping.

class MySpider(BaseSpider):
name = "dkcourses"
start_urls = ['http://www.example.com/scrapy/all-courses-listing']
allowed_domains = ["example.com"]
def parse(self, response):
 hxs = Selector(response)
 for courses in response.xpath("//body"):
 yield {
            'pagetype': ''.join(courses.xpath('.//meta[@name="dkpagetype"]/@content').extract()),
            'pagefeatured': ''.join(courses.xpath('.//meta[@name="dkpagefeatured"]/@content').extract()),
            'coursetloc': ''.join(courses.xpath('.//meta[@name="dkcoursetloc"]/@content').extract()),
            'coursetfees': ''.join(courses.xpath('.//meta[@name="dkcoursetfees"]/@content').extract()),
           }
 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract()):
  yield Request(response.urljoin(url), callback=self.parse)

Any help is very appreciated. Thanks

1
post url or html code宏杰李
@宏杰李 Posted the code ...Slyper
I mean the website url宏杰李

1 Answers

1
votes
  1. Use extract_first() to get the first value in the extract(), do not use join()
  2. Use [starts-with(@name, "dkn")] to find the meta tag, //meta means in all the content of document.

In [5]: for meta in response.xpath('//meta[starts-with(@name, "dkn")]'):
   ...:     name = meta.xpath('@name').extract_first()
   ...:     content = meta.xpath('@content').extract_first()
   ...:     print({name:content})

out:

{'dknpagetype': 'Course'}
{'dknpagefeatured': ''}
{'dknpagedate': '2016-01-01'}
{'dknpagebanner': 'http://www.deakin.edu.au/__data/assets/image/0006/757986/Banner_Cyber-Alt2.jpg'}
{'dknpagethumbsquare': 'http://www.deakin.edu.au/__data/assets/image/0009/757989/SQ_Cyber1-2.jpg'}
{'dknpagethumblandscape': 'http://www.deakin.edu.au/__data/assets/image/0007/757987/LS_Cyber1-1.jpg'}
{'dknpagethumbportrait': 'http://www.deakin.edu.au/__data/assets/image/0008/757988/PT_Cyber1-3.jpg'}
{'dknpagetitle': 'Graduate Diploma of Cyber Security'}
{'dknpageurl': 'http://www.deakin.edu.au/course/graduate-diploma-cyber-security'}
{'dknpagedescription': "Take your understanding of cyber security to the next level with Deakin's Graduate Diploma of Cyber Security and build your capacity to investigate and combat cyber-crime."}
{'dknpageid': '723503'}