0
votes

The task is simple- I want to tranfer a html file from an URL to a variable and read the feed below: How can I read the contents of an URL with Python?

All that works well except with the url = "https://www.goyax.de/"

with

import urllib
#fp = urllib.request.urlopen("https://www.spiegel.de/")
fp = urllib.request.urlopen("https://www.goyax.de/")
print("Result code: " + str(fp.getcode()))
print("Returned data: -----------------")
data = fp.read().decode("utf-8")
print(data)

I get only "403" and "Forbidden". Also with

import requests
url = 'https://www.goyax.de/'
#url = 'https://www.spiegel.de'
r = requests.get(url)
tt = r.text
print(tt)

I don't get an improvement. With other URLs both solutions work well so far.

Until now I was using an Autohotkey script (UrlDownloadToFile) (Windows only) and tried it also with Octave (s = urlread("https://www.goyax.de/")) where I get the right result and no error message. the scripts ae running sicne years on a PC but I want to move this task to a Raspberry Pi. Because of that I was learning Python

The output / error messages:

fp = urllib.request.urlopen("http://www.goyax.de/")

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open response = meth(req, response)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response response = self.parent.error(

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 563, in error result = self._call_chain(*args)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain result = func(*args)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 755, in http_error_302 return self.parent.open(new, timeout=req.timeout)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open response = meth(req, response)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response response = self.parent.error(

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error return self._call_chain(*args)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain result = func(*args)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: Forbidden

1
This is the output / error messages:Eleos

1 Answers

0
votes

Well, I found the answer myself or with some help of a friend: The key is to the soluition is to set the user agent.

Solution 1) (with "Requests")

import requests
r = requests.get(url2, headers={"User-Agent":'Mozilla/5.0'}, timeout=25)
mystr=r.text  # mybytes=r.content
print(mystr)

Solution 2) (with "urllib.request" + "CookieJar")

import urllib.request
from http.cookiejar import CookieJar

req = urllib.request.Request(url2, None, {"User-Agent":'Mozilla/5.0'}) 
# instead of #req = urllib.request.Request(url2)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(CookieJar()))
response = opener.open(req)
content = response.read()
mystr2   = content.decode("utf8")
print(mystr2)

Usually the user agent 'Mozilla/5.0' is sufficient. Or check https://manytools.org/http-html-text/user-agent-string/ and https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/ for a real user agent string.