3
votes

I found that some page I crawling is slow, and using Goagent to visit the page is relatively fast, so I run this before I start my spider:

export http_proxy=http://192.168.1.102:8087

Yet, when I start the spider it report this:

[<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]

to validate the proxy I run this curl command:

curl -I  -x 192.168.1.102:8087 http://www.blabla.com/target/page.php

and the output header seems quite normal for me:

HTTP/1.1 200
Content-Length: 0
Via: HTTP/1.1 GWA
Content-Encoding: gzip
X-Powered-By: PHP/5.3.3
Vary: Accept-Encoding
Server: Apache/2.2.15 (CentOS)
Connection: close
Date: Sun, 30 Mar 2014 16:49:29 GMT
Content-Type: text/html

I tried add this to scrapy's settings.py:

 DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':100
 }

Still, no luck. Is it some problem with scrapy or am I missing something else?

My scrapy version is Scrapy 0.22.2

1

1 Answers

1
votes

You could have a try to enable both http_proxy and https_proxy.

export http_proxy=http://192.168.1.102:8087
export https_proxy=http://192.168.1.102:8087

and I guess your Twisted is 15.0.0, this version has something wrong with https throw proxy.