1
votes

UPDATE: it seems that the user-agent isn't the only header some hosts require to serve HTML, I also had to add the 'accepts' header, in the end this solved the problem for me with many hosts:

  $response = $client->request('GET', 'http://acme.com', ['headers' => ['user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
'accept'=> 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
        ]]);

I'm trying to use Guzzle to retrieve some websites but recieving a 403 forbidden error (when they work fine in a browser), I suspect this is down to non-standard User-Agents being forbidden by the host. To get around this, I am trying to set the User-Agent in Guzzle to mimic a browser but I can't find any method that actually works. I can browse to the website and also use WGET and CURL -L to download the HTML with no problems so the issue seems to be with Guzzle.

I've tried:

    $client = new Client(['allow_redirects' => ['track_redirects' => true]]);
    $client->setUserAgent("Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36");
    $response = $client->get($domain_name);

Weirdly this ^ one results in an error that seems to say Guzzle is trying to browse to the User-Agent value: cURL error 6: Could not resolve host: Mozilla (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for Mozilla/5.0%20(Windows%20NT%206.2;%20WOW64)%20AppleWebKit/537.36%20(KHTML,%20like

    $domain_name = 'http://www.' . $domain_name;
    $client = new Client(['headers' => ['User-Agent' => 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36']]);
    $response = $client->get($domain_name);

^Results in a "Client error: GET http://www.xxx.co.uk resulted in a `403 Forbidden'" error

    $domain_name = 'http://www.' . $domain_name;
    $client = new Client(['allow_redirects' => ['track_redirects' => true]]);
    $client->setServerParameter('user-agent', "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
    $response = $client->get($domain_name);

^Results in a "Argument 3 passed to GuzzleHttp\Client::request() must be of the type array, string given" error

    $domain_name = 'http://www.' . $domain_name;
    $client = new Client(['allow_redirects' => ['track_redirects' => true]]);
    $client->setHeader("user-agent", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
    $response = $client->get($domain_name);

^ Also results in a "Argument 3 passed to GuzzleHttp\Client::request() must be of the type array, string given" error

Any suggestions? I've gone down a rabbit hole here I think!

I'm wondering if something else is going on here because as I understand it, Guzzle is just a wrapper for CURL and CURL can fetch the same web page, from the same IP with no problem.

1
use on_stats as guzzle options, you will be able to see what user agent is seen, when you call $stats->getHandlerStats(), at first you should check after applying the first method the proper user-agent header is sent or notbhucho
it would still have error I told to use stats to help in debugging at the time of sending requests getHandlerStats() provides stats of all the details in an array, also guzzle can use Curl, but it is not a wrapper guzzle is a http client not a curl client, it can work even without curlbhucho
use ['debug' => true] it will also tell the name of user-client when sending requestsbhucho
have a look at this upgrade github.com/guzzle/guzzle/blob/master/UPGRADING.md#client . As per this upgrade setUserAgent() has been removed.bhucho
Your second method $client = new Client(['headers' => ['User-Agent' => 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36']]); should give you the user agent as mozilla, but if it is still giving 403 then the error seems to be not due to user-agentbhucho

1 Answers

0
votes

UPDATE: it seems that the user-agent isn't the only header some hosts require to serve HTML, I also had to add the 'accepts' header, in the end this solved the problem for me with many hosts:

$response = $client->request('GET', 'http://acme.com', ['headers' => ['user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
'accept'=>'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9']]);