2
votes

I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'
2
The site isn't loading at all for me. What does response.body look like? - Blender
You can always include print sites and see what is printed during crawling. - alecxe
This site is our intranet so you won't have access to it. if i type in response.body i do get back the xml-stylesheet (i could not post the whole thing too many characters) >>> response.body '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n <?xml-stylesheet type="text/xsl" href="/dis/xslt/htmlpage.xslt"?>\n \n <page hide-loading="true">\n <title>Login</title>\n \n <head>\n <link rel="stylesheet" type="text/css" href="/dis/css/login.css"> </link > - Gio
I added print sites to my code but nothing happen the only difference I saw in the cmd prompt was it had an empty brackets [] - Gio

2 Answers

1
votes

Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs instead of hxs as in this Scrapy documentation example about removing namespaces: http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces

When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:

import lxml.etree
import lxml.html

class myspider(BaseSpider):
    ...
    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        root = lxml.etree.fromstring(response.body).getroot()
        # or for broken XML docs:
        # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
        # or for HTML:
        # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()

        # and then lookup what are the actual elements I can select
        print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces
1
votes

You can use pdb from the command line and add a breakpoint in your file. But it might involve some steps.

(It may differ slightly for windows debugging)

  1. Locate your scrapy executable:

    $ whereis scrapy
    /usr/local/bin/scrapy
    
  2. Call it as python script and start pdb

    $ python -m pdb /usr/local/bin/scrapy crawl quotes
    
  3. Once in the debugger shell, open another shell instance and locate the path to your spider script (residing in your spider project)

    $ realpath path/to/your/spider.py
    /absolute/spider/file/path.py
    

This will output the absolute path. Copy it to your clipboard.

  1. In the pdb shell type:

    b /absolute/spider/file/path.py:line_number
    

...where line number is the desired point to break when debugging that file.

  1. Hit c in the debugger...

Now go do some PythonFu :)