1
votes

I am trying to verify if a online radio url is delivering music and if the url was redirected or not (this happens if for some reason the request url is wrong or not active). I found some advices here Fetching url in python with google app engine. However, for an url that delivers Content-Type:audio/mpeg it doesn't seem to work.

On my local machine using python 2.7.6 urllib2.urlopen everything is fine:

try:
    print "begin urlopen"
    url = urllib2.urlopen("http://streaming.radionomy.com/jamaican-roots-radio")
    print "end urlopen"

except Exception, e:
    print e

gives

begin urlopen

end urlopen

I can the read N bytes from the returned object (which is a socket._fileobject) and use the method geturl() to get the actual url from which the stream is coming (if there was no redirection the request url and the retrieved resource url are the same)

The problems arise using dev_appserver.py for google appengine (I didn't deployed yet). The call never returns:

begin urlopen

WARNING 2015-06-12 14:31:43,599 urlfetch_stub.py:504] Stripped prohibited headers from URLFetch request: ['Host']

and "end urlopen" is never printed.

I understand the warning error, so I switched (as suggested in the link above) to urlfetch:

try:
    print "begin fetch"
    url = urlfetch.fetch("http://streaming.radionomy.com/jamaican-roots-radio")
    print "end fetch"

except Exception, e:
    print e

gives

begin

The warnings is gone, but again the call doesn't return.

For a normal webpage url, everything is as expected. I guess that the problem is the response object that is never finished. Also using

urlfetch.set_default_fetch_deadline(5)

doesn't change the situation, probably because the data are continuously streamed from the server (and therefore no timeout is called??). I also tried the low level httplib.HTTPConnection, but after making the request the getresponse() function never returns.

To my purpose, the response header would be enough. But on the server (which is not under my control) the HEAD method is not implemented (despite being listed in Access-Control-Allow-Methods, as it can bee seen from a browser)

curl -X HEAD -i http://streaming.radionomy.com/jamaican-roots-radio

HTTP/1.0 501 Not Implemented

I didn't find any question on stackoverflow covering the case of a stream url except this one How to call Twitter's Streaming/Filter Feed with urllib2/httplib?. Unfortunately, the suggested response is not very helpful for me ("Using Twitter's 'standard' API").

Any idea I can solve this problem?

UPDATE

On google appengine (not on dev_appserver.py as above) the problems are similar:

  • with a deadline of 5 sec

Deadline exceeded while waiting for HTTP response from URL...

  • with a deadline of 60 sec

Traceback (most recent call last):

File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle result = handler(dict(self._environ), self._StartResponse)

File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in call rv = self.router.dispatch(request, response)

File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher return route.handler_adapter(request, response)

File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in call return handler.dispatch()

File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch return method(*args, **kwargs)

File "/base/data/home/apps/s~radiosnoozers/3.384985169499124712/controllers/checkurl.py", line 80, in get print e

File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/request_environment.py", line 94, in write self._request.errors.write(data)

File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/logservice/logservice.py", line 287, in write self._write(line)

File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/logservice/logservice.py", line 307, in _write if self._request != logsutil.RequestID():

DeadlineExceededError

The timeout is respected and there are no difference by using using allow_truncated=True. In any case, no access to the response...

I really don't know what is going on, but thanks for the given suggestions.

2
interesting. I guess there's not match you can do. URLFetch is an API to the Google's HTTP Request service infrastructure rather than a mere library. How about specifying allow_truncated=True ? The request should finish after receiving 32MB of data. I know it's wasteful though.Kenji Noguchi
Well, it doesn't work either! Moreover, that will take about 30 min (at a bitrate of 128 kbps) and for an hourly cron job on the app engine is really not ok.Clod
Use a Managed VM, or running on EC2, and just check that from appengineTim Hoffman

2 Answers

0
votes

UrlFetch is meant for fetching a finite resource from a URL, and generally doesn't play nice with streams. It's waiting for the request to terminate. I believe that the endpoint doesn't play well with Range requests in general. Look at the headers when my browser hits that stream (great stream by the way):

GET http://streaming.radionomy.com/jamaican-roots-radio HTTP/1.1
Host: streaming.radionomy.com
Proxy-Connection: keep-alive
Accept-Encoding: identity;q=1, *;q=0
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
Accept: */*
Referer: http://streaming.radionomy.com/jamaican-roots-radio
Accept-Language: en-US,en;q=0.8
Cookie: gsScrollPos=
Range: bytes=0-

And now take a look at the response:

HTTP/1.1 200 OK
Accept-Ranges: none
icy-br: 128
ice-audio-info: bitrate=128;samplerate=44100;channels=2
icy-br: 128
icy-description: Radio Online producida en Colombia.  Al aire: Ska Reggae Rocksteady jamaiquino las 24 horas los 7 días a la semana. http://www.jamaicanroots.com.co/
icy-genre: Jamaican
icy-name: JamaicanRootsRadio
icy-pub: 1
icy-url: http://www.jamaicanroots.com.co
Server: Icecast 2.3.3-kh8
Cache-Control: no-cache, no-store
Pragma: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Origin, Accept, X-Requested-With, Content-Type
Access-Control-Allow-Methods: GET, OPTIONS, HEAD
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Transfer-Encoding: chunked
Content-Type: audio/mpeg
Date: Wed, 17 Jun 2015 19:35:42 GMT
Via: **[my proxy here]**
Connection: keep-alive
Proxy-Connection: keep-alive

In fact, as I hinted above, I think the stream itself is not playing nice with HTTP. If you try to run an equivalent request via CURL and specify Range: bytes=0-100, you'll notice that the Range request header isn't respected, and it'll stream forever.

So, it seems you'll need to use a Managed VM or Compute Engine instance to manually open and close the connection.

0
votes

If that URL is a streaming endpoint over HTTP, it is probably done using http range requests. This means that if you want to grab just a certain byte range of the stream (say the first few bytes), you need to tell urlfetch to do that. You do this by specifying request headers for urlfetch and specifying the byte range (for example headers={'Range': 'bytes=0-299'})