I'm trying to save info from the http://www.woorank.com search results. The site caches data for popular sites, but for most you need to do a search before it returns a report. So I tried this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.woorank.com/en/report/generate');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('url'=>'hellothere.com'));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
It seems (based on curl output) to redirect to http://www.woorank.com/en/www/hellothere.com, as it should after you search, but it doesn't generate a report and simply states there is no report yet (just as it would when you visit the url directly).
Am I doing something wrong? Or is it not possible to retrieve this info?
Update
Request headers: http://pastebin.com/3ijZfMmF
(Request-Line) POST /en/report/generate HTTP/1.1 Host www.woorank.com User-Agent Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language en-us,en;q=0.5 Accept-Encoding gzip,deflate Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive 115 Connection keep-alive Referer http://www.woorank.com/ Cookie __utma=201458455.1161920622.1291713267.1291747441.1291773488.4; __utmc=201458455; __utmz=201458455.1291713267.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=201458455.1.10.1291773488 Content-Type application/x-www-form-urlencoded Content-Length 16
I'm not sure how to get the request headers from the test script, but using this:
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
$headers = curl_getinfo($ch);
The $headers var contains:
Array
(
[url] => http://www.woorank.com/en/www/someothersite.com
[content_type] => text/html; charset=UTF-8
[http_code] => 200
[header_size] => 841
[request_size] => 280
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 1
[total_time] => 0.904581
[namelookup_time] => 3.2E-5
[connect_time] => 3.3E-5
[pretransfer_time] => 3.7E-5
[size_upload] => 155
[size_download] => 5297
[speed_download] => 5855
[speed_upload] => 171
[download_content_length] => 5297
[upload_content_length] => 0
[starttransfer_time] => 0.242975
[redirect_time] => 0.577306
[request_header] => GET /en/www/someothersite.com HTTP/1.1
Host: www.woorank.com
Accept: */*
)
It seems to me that this is the redirect that happens after the search form is submitted. But I'm not sure whether there's no POST at all, or that it isn't visible in these headers. But since it doesn't work, I'm guessing it's the former.
The output from curl_exec is simply the HTML from http://www.woorank.com/en/www/someothersite.com.
Update 2
I tried adding some of the headers to the curl request using:
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
and e.g.
$headers = array(
"Host: www.woorank.com",
"Referer: http://www.woorank.com/"
);
Doesn't make it POST the form, but now the curl_exec shows the response headers. Here's the difference:
Firefox, response headers from site:
HTTP/1.1 302 Found Date Wed, 08 Dec 2010 02:19:18 GMT Server Apache/2.2.9 (Fedora) X-Powered-By PHP/5.2.6 Set-Cookie language=en; expires=Wed, 08-Dec-2010 03:19:18 GMT; path=/ Set-Cookie generate=somesite.com; expires=Wed, 08-Dec-2010 03:19:19 GMT; path=/ Location /en/www/somesite.com Cache-Control max-age=1 Expires Wed, 08 Dec 2010 02:19:19 GMT Vary Accept-Encoding,User-Agent Content-Encoding gzip Content-Length 20 Keep-Alive timeout=1, max=100 Connection Keep-Alive Content-Type text/html; charset=UTF-8
and from test.php:
HTTP/1.1 302 Found Date: Wed, 08 Dec 2010 02:27:21 GMT Server: Apache/2.2.9 (Fedora) X-Powered-By: PHP/5.2.6 Set-Cookie: language=en; expires=Wed, 08-Dec-2010 03:27:21 GMT; path=/ Set-Cookie: generate=someothersite.com; expires=Wed, 08-Dec-2010 03:27:22 GMT; path=/ Location: /en/www/someothersite.com Cache-Control: max-age=1 Expires: Wed, 08 Dec 2010 02:27:22 GMT Vary: Accept-Encoding,User-Agent Content-Length: 0 Keep-Alive: timeout=1, max=100 Connection: Keep-Alive Content-Type: text/html; charset=UTF-8
I only notice Content-Encoding gzip and Content-Length 20 missing in the test. Don't know what that means but when adding "Content-Length: 20" to the headers it says "HTTP/1.1 413 Request Entity Too Large" and doesn't do anything; adding "Content-Encoding: gzip" makes it return the HTML gzipped (I assume, since it looks like this: "‹ÍXésÚ8ÿœüZíì&]ìºG “æè1 MmÚ...").
Hope this info helps.