1
votes

extract Meta tags from website using portia (scrapy)

i want to use portia to extract the meta tags from some website but its not showing head tag , it is starting from body tag only

i am only able to extract data from body tag

2

2 Answers

7
votes

You need to annotate an element within the body and then navigate to the element in the head that you want to map.

  1. Annotate an element on the page, it doesn't matter which one.
  2. Click the settings icon either in the annotation popup or within the annotations panel on the right-hand toolbox.
  3. Click the html element. You will get a warning that you will lose any mapped attributes to the annotation, click OK.
  4. Click the settings icon again, and this time select the head element.
  5. Click the settings icon yet again, and you can select children elements within the head.
  6. Once you've selected the element, click the + Field button to create a new field and then map the desired attribute value to the target field.

See also: https://github.com/scrapinghub/portia/issues/60

1
votes

you can use this for meta names:

meta_name = hxs.select('//meta/@name').extract()

and this for meta contents:

meta_content = hxs.select('//meta/@content').extract()

and this for content of a meta with a particular name like description:

meta = hxs.select('//meta[@name=\'description\']/@content').extract()