I want to scrape the TheRegister.com Security section and parse the XML parts into a data structure.
In the Scrapy Shell I've tried:
>>> fetch('https://www.theregister.com/security/headlines.atom')
resulting in response
2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)
The response has a body that can be viewed, see a snippet below (I only selected the first couple of lines)
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>tag:theregister.com,2005:feed/theregister.com/security/</id>
<title>The Register - Security</title>
<link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
<link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
<rights>Copyright © 2020, Situation Publishing</rights>
<author>
<name>Team Register</name>
<email>[email protected]</email>
<uri>https://www.theregister.com/odds/about/contact/</uri>
</author>
<icon>https://www.theregister.com/Design/graphics/icons/favicon.png</icon>
<subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
<logo>https://www.theregister.com/Design/graphics/Reg_default/The_Register_r.png</logo>
<updated>2020-11-06T23:58:13Z</updated>
<entry>
<id>tag:theregister.com,2005:story211912</id>
<updated>2020-11-06T23:58:13Z</updated>
<author>
<name>Thomas Claburn</name>
<uri>https://search.theregister.com/?author=Thomas%20Claburn</uri>
</author>
<link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
<title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
<summary type="html" xml:base="https://www.theregister.com/"><h4>Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox</h4> <p>Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about <a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/">220m domains</a>, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p></summary>
</entry>
Why can I not parse any data with the regular Xpath method? I've tried:
>>> response.xpath('entry')
[]
>>> response.xpath('/entry')
[]
>>> response.xpath('//entry')
[]
>>> response.xpath('.//entry')
[]
>>> response.xpath('entry/text()')
[]
>>> response.xpath('/entry/text()')
[]
>>> response.xpath('//entry/text()')
[]
>>> response.xpath('.//entry/text()')
[]
All with no luck. Also the other xml-tags, like title, link, author I cannot extract.