0
votes

I want to scrape the TheRegister.com Security section and parse the XML parts into a data structure.

In the Scrapy Shell I've tried:

>>> fetch('https://www.theregister.com/security/headlines.atom')

resulting in response

2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)

The response has a body that can be viewed, see a snippet below (I only selected the first couple of lines)

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>tag:theregister.com,2005:feed/theregister.com/security/</id>
  <title>The Register - Security</title>
  <link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
  <link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
  <rights>Copyright © 2020, Situation Publishing</rights>
  <author>
    <name>Team Register</name>
    <email>[email protected]</email>
    <uri>https://www.theregister.com/odds/about/contact/</uri>
  </author>
  <icon>https://www.theregister.com/Design/graphics/icons/favicon.png</icon>
  <subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
  <logo>https://www.theregister.com/Design/graphics/Reg_default/The_Register_r.png</logo>
  <updated>2020-11-06T23:58:13Z</updated>
  <entry>
    <id>tag:theregister.com,2005:story211912</id>
    <updated>2020-11-06T23:58:13Z</updated>
    <author>
      <name>Thomas Claburn</name>
      <uri>https://search.theregister.com/?author=Thomas%20Claburn</uri>
    </author>
    <link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
    <title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
    <summary type="html" xml:base="https://www.theregister.com/">&lt;h4&gt;Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox&lt;/h4&gt; &lt;p&gt;Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about &lt;a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/"&gt;220m domains&lt;/a&gt;, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…&lt;/p&gt; &lt;p&gt;&lt;!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --&gt;&lt;/p&gt;</summary>
  </entry>

Why can I not parse any data with the regular Xpath method? I've tried:

>>> response.xpath('entry')
[]
>>> response.xpath('/entry')
[]
>>> response.xpath('//entry')
[]
>>> response.xpath('.//entry')
[]
>>> response.xpath('entry/text()')
[]
>>> response.xpath('/entry/text()')
[]
>>> response.xpath('//entry/text()')
[]
>>> response.xpath('.//entry/text()')
[]

All with no luck. Also the other xml-tags, like title, link, author I cannot extract.

1

1 Answers

1
votes

TLDR; execute response.selector.remove_namespaces() before running response.xpath()

It essentially means that you are removing xmlns="http://www.w3.org/2005/Atom" from response to write easier XPath. Alternative, you can register the namespace and change your selectors to include this namespace:

response.selector.register_namespace('n', 'http://www.w3.org/2005/Atom')
response.xpath('//n:entry')

You can read more details here.