How can I debug Scrapy?

Question

I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

The site isn't loading at all for me. What does response.body look like? — Blender
You can always include print sites and see what is printed during crawling. — alecxe
This site is our intranet so you won't have access to it. if i type in response.body i do get back the xml-stylesheet (i could not post the whole thing too many characters) >>> response.body '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n <?xml-stylesheet type="text/xsl" href="/dis/xslt/htmlpage.xslt"?>\n \n <page hide-loading="true">\n <title>Login</title>\n \n <head>\n <link rel="stylesheet" type="text/css" href="/dis/css/login.css"> </link > — Gio
I added print sites to my code but nothing happen the only difference I saw in the cmd prompt was it had an empty brackets [] — Gio

paul trmbrth paul trmbrth · Accepted Answer · 2013-07-14T21:51:05

Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs instead of hxs as in this Scrapy documentation example about removing namespaces: http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces

When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:

import lxml.etree
import lxml.html

class myspider(BaseSpider):
    ...
    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        root = lxml.etree.fromstring(response.body).getroot()
        # or for broken XML docs:
        # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
        # or for HTML:
        # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()

        # and then lookup what are the actual elements I can select
        print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces

How can I debug Scrapy?

2 Answers