0
votes

I have a scrapy crawlspider that extracts links from an image map using the SgmlLinkExtractor in a rule like this,

Rule(SgmlLinkExtractor(allow_domains=('pressen-haas.de'),
                       restrict_xpaths=('//map[@name="bildmaschinen"]')))

The start url is http://www.pressen-haas.de/neu//machines.php?lang=en if you want to have a look. The resulting urls are of the form http://www.pressen-haas.de/neu//masch_cat.php?lang=en&phid=0 where the phid parameter runs from 0 to 8. All fine so far, the spider gets the 9 different urls and crawls them, the problem is that when the spider gets these pages all the information that is there if you enter the url into a browser is not available. I wrote a callback to test it which does,

hxs = HtmlXPathSelector(response)
print hxs.select('//text()').extract()

to see what is there and the results are exactly what you see if you type the url into a browser and remove the second url parameter i.e. http://www.pressen-haas.de/neu/masch_cat.php?lang=en

I have checked that the spider is crawling the correct urls, I can copy the crawled urls from the spider output log into a browser and they work fine, why can I see these urls in a browser but the spider sees something different?

Thanks in advance.

1
Can you post your full spider code.Steven Almeroth

1 Answers

0
votes

The html for the pages the spider was attempting to scrape was very badly formed and I am fairly sure that this is the problem rather than an issue with the spider itself.