1
votes

There is a webpage that my browser can access, but urllib2.urlopen() (Python) and wget both return HTTP 403 (Forbidden). Is there a way to figure out what happened?

I am using the most primitive form, like urllib2.urlopen("http://test.com/test.php"), using the same url (http://test.com/test.php) for both the browser and wget. I have cleared all my cookies in browser before the test.

Thanks a lot!

2
I tried it it prints success bla what system and what version of python are you running.user1786283
the site might be hindering a screen scraping. See http://test.com/robots.txt. Try to change User-Agent header.jfs
@enginefree I don't think that OP meant http://test.com/test.php literally.Nathan
@J.F.Sebastian what else can they do to hinder screen scraping? I have made the headers exactly the same as what I saw from LiveHTTPheadersCuriousMind
Does it work if you turn off javascript, flash, images in the browser?jfs

2 Answers

2
votes

The Python library urllib has a default user-agent string that includes the word Python in it and wget uses "wget/VERSION". If the site you are cionnectiing checks the user-agent info, it will probably reject these two. Google, for instance, will do so.

It's easy enough to fix.. for wget, use the -U parameter and for urllib, create a URLOpener with an appropriate string.

0
votes

Some sites dont allow web scraping. Try using Python requests.

This library should work.