I have a scrapy crawlspider that extracts links from an image map using the SgmlLinkExtractor in a rule like this,
Rule(SgmlLinkExtractor(allow_domains=('pressen-haas.de'),
restrict_xpaths=('//map[@name="bildmaschinen"]')))
The start url is http://www.pressen-haas.de/neu//machines.php?lang=en if you want to have a look. The resulting urls are of the form http://www.pressen-haas.de/neu//masch_cat.php?lang=en&phid=0 where the phid parameter runs from 0 to 8. All fine so far, the spider gets the 9 different urls and crawls them, the problem is that when the spider gets these pages all the information that is there if you enter the url into a browser is not available. I wrote a callback to test it which does,
hxs = HtmlXPathSelector(response)
print hxs.select('//text()').extract()
to see what is there and the results are exactly what you see if you type the url into a browser and remove the second url parameter i.e. http://www.pressen-haas.de/neu/masch_cat.php?lang=en
I have checked that the spider is crawling the correct urls, I can copy the crawled urls from the spider output log into a browser and they work fine, why can I see these urls in a browser but the spider sees something different?
Thanks in advance.