I am new to web scraping and have stumbled upon an unexpected challenge. The goal is to input an incomplete URL string for a website and "catch" the corrected URL output returned by the website's redirect function. The specific website that I referring to is Marine Traffic.
When searching for a specific vessel profile, a proper query string should contain the paramters shipid
, mmsi
and imo
. For example, this link will return a webpage with the profile for a specific vessel:
As it turns out, a query string with only the imo
parameter will redirect to the exact same url. So, for example, the following query will redirect to the same one as above:
https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
My question is, using cURL in bash or another such tool such as the python requests
library, how could one catch the redirect URL in an automated way? Curling the first URL returns the full html, while curling the second URL throws an Access Denied error. Why is this allowed in the browser? What is the workaround for this, if any, and what are some best practices for catching redirect URLs (using either python or bash)?
curl https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
#returns Access Denied
Note: Adding a user agent to curl --user-agent 'Chrome/79'
does not get around the issue. The error is avoided but nothing is returned.