I want to scrape the TheRegister.com Security section and parse the XML parts into a data structure.
In the Scrapy Shell I've tried:
>>> fetch('https://www.theregister.com/security/headlines.atom')
resulting in response
2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)
The response has a body that can be viewed, see a snippet below (I only selected the first couple of lines)
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<title>The Register - Security</title>
<link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
<link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
<rights>Copyright © 2020, Situation Publishing</rights>
<name>Team Register</name>
<email>[email protected]</email>
<subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
<name>Thomas Claburn</name>
<link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
<title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
<summary type="html" xml:base="https://www.theregister.com/"><h4>Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox</h4> <p>Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about <a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/">220m domains</a>, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p></summary>
Why can I not parse any data with the regular Xpath method? I've tried:
>>> response.xpath('entry')
>>> response.xpath('/entry')
>>> response.xpath('//entry')
>>> response.xpath('.//entry')
>>> response.xpath('entry/text()')
>>> response.xpath('/entry/text()')
>>> response.xpath('//entry/text()')
>>> response.xpath('.//entry/text()')
All with no luck. Also the other xml-tags, like title, link, author I cannot extract.