10
votes

I am trying to decode the webpage www.dealstan.com using CURL by using the below code:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");     
curl_setopt($ch, CURLOPT_TIMEOUT,5); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch); 

$html = str_get_html("$return");
echo $html;

but, it is showing some junk charater

"��}{w�6����9�X�n���.........." for about 100 lines.

I tried to find the response in hurl.it, found one interesting point, it looks like the html is encoded twice(just a guess, based on the response)

Find the response below: GET http://www.dealstan.com/

200 OK 18.87 kB 490 ms View Request View Response HEADERS

Cache-Control: max-age=0, no-cache

Cf-Ray: 18be7f54f8d80f1b-IAD

Connection: keep-alive

Content-Encoding: gzip, gzip ==============>? suspecting this, anyone know about it?

Content-Type: text/html; charset=UTF-8

Date: Wed, 19 Nov 2014 18:33:39 GMT

Server: cloudflare-nginx

Set-Cookie: __cfduid=d1cff1e3134c5f32d2bddc10207bae0681416422019; expires=Thu, 19-Nov-15 18:33:39 GMT; path=/; domain=.dealstan.com; HttpOnly

Transfer-Encoding: chunked

Vary: Accept-Encoding

X-Page-Speed: 1.8.31.2-3973

X-Pingback: http://www.dealstan.com/xmlrpc.php

X-Powered-By: HHVM/3.2.0 BODY view raw

H4sIAAAAAAAAA5V8Q5AoWrBk27Ztu/u2bdu2bdu2bdu2bds2583f/pjFVOQqozZnUxkVJ7PwoyAA/qeAb3y83LbYHs/3Hv79wKm/2N5cZyJVtCWu1xyteyzLNqYuWbdtHeELCyIZRRp/1Fe7es3+wL3Vfb

anyone knows how to decode the response with the header "Content-Encoding: gzip, gzip",

That site is loading properly in firefox, chrome etc. but, i am not able to decode using CURL.

Please help me to decode this issue?

1
In google, found one bug which is reported in mozilla for the similar issue, bugzilla.mozilla.org/show_bug.cgi?id=205156, but i could not find any patch for that bug, since the site is loading properly in firefox, they should have solved this issue - stackguy
Odd. The junk is exactly what's coming back—it shows that way in Safari, too. So it's basically sending back the page gzipped, even though it claims that the Content-Type is text/html. (Is it meant to look like that? Looks to me like their website is just broken. It shows, as I'd expect, the textual representation of the GZIP data if I browse there in Safari...) NB: It seems to be gzipping it in transit, and also sending a gzipped version of the page, so I needed to gunzip it twice to see the actual HTML. - Matt Gibson
Just checked a couple of other browsers—Firefox and Chrome successfully show me the webpage; Opera and Safari show me raw gzip data. So, I'd say that the website is misconfigured and is gzipping the page twice, but that some web browsers are detecting this brokenness and decoding it twice for you. I'm not sure I'd rely on it always being broken like that, as sooner or later they're going to realise that their website is broken in some major browsers, and fix the configuration... - Matt Gibson
As you said, they solved the issue, now, i am able to parse it without any issues. Anyway if we come to know how firefox is able to handle it properly, that will help us for solving the similar issue in future. - stackguy

1 Answers

8
votes

You can decode it by trimming off the headers and using gzinflate.

$url = "http://www.dealstan.com"

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING, "gzip");     
curl_setopt($ch, CURLOPT_TIMEOUT, 5); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch); 

$return = gzinflate(substr($return, 10));
print_r($return);