3
votes

I am new to web scraping and have stumbled upon an unexpected challenge. The goal is to input an incomplete URL string for a website and "catch" the corrected URL output returned by the website's redirect function. The specific website that I referring to is Marine Traffic.

When searching for a specific vessel profile, a proper query string should contain the paramters shipid, mmsi and imo. For example, this link will return a webpage with the profile for a specific vessel:

https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA/_:97e0de64144a0d7abfc154ea3bd1010e

As it turns out, a query string with only the imo parameter will redirect to the exact same url. So, for example, the following query will redirect to the same one as above:

https://www.marinetraffic.com/en/ais/details/ships/imo:9337987

My question is, using cURL in bash or another such tool such as the python requests library, how could one catch the redirect URL in an automated way? Curling the first URL returns the full html, while curling the second URL throws an Access Denied error. Why is this allowed in the browser? What is the workaround for this, if any, and what are some best practices for catching redirect URLs (using either python or bash)?

curl https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
#returns Access Denied

Note: Adding a user agent to curl --user-agent 'Chrome/79' does not get around the issue. The error is avoided but nothing is returned.

1

1 Answers

1
votes

You can try .url on response object:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9337987"

r = requests.get(url, headers=headers)
print(r.url)

Prints:

https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA