1
votes

I found a json file that contains all the data of the page. I notice in browser that after downloading the webpage the js sends a POST request to load the json file. If I put the json file link directly in the browser it is a GET request. Is it a better practice to get the webpage and use the relevant cookies to request the json file instead of requesting the json link directly? This way looks more like a normal user?

I also see in the POST and GET requests a XSRF-TOKEN was sent. I suppose this is needed in the request?

Here are POST vs GET requests (stripped out the analytics and 3rd tool cookies). Which items in the headers are best to be included in the requests? Plan to rotate user-agent and IP.

POST
Host:
Connection: keep-alive
Content-Length: 2
Accept: application/json, text/plain, /
X-XSRF-TOKEN:
ADRUM: isAjax:true
User-Agent:
Content-Type: application/json;charset=UTF-8
Origin:
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Referer:
Accept-Encoding: gzip, deflate, br
Accept-Language: en,en-US
Cookie: JSESSIONID=; XSRF-TOKEN=;

GET
Host:
Connection:
Upgrade-Insecure-Requests: 1
User-Agent:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en,en-US
Cookie: JSESSIONI=; XSRF-TOKEN=;

1
Depends on the type of scraping you're doing. If the goal is to not get banned, then mimicking the HTTP request that's being made is probably best. It would help if you provided an URL to answer the second question.AaronS

1 Answers

1
votes

So having played around with the HTTP request that the javascript makes to grab the data you want. This is the code I came up with. I essentially just removed dictionary keys and values until I got to a minimum needed.

Within chrome tools, you can access the XHR by inspecting the page and going to network tools. It makes three requests, two with the same JSON file you've found. You can copy this request as a cURL command. Inputting this into curl.trillworks.com gives you the required headers, params, cookies needed to make that request. I usually play about till I get the minimum I need to make the request of the server. In this case, you do need cookie token and JSESSIONID.

So in order to make the request, you'll need a fresh X-XSFR token and JSESSIONID. The first example is hard-coded, this is just to illustrate exactly what is needed. The second example is how to use the request to grab the cookies and then use that to get the json object without hard-coding.

Hard coding cookies Code example

import requests

cookies = {'JSESSIONID': '6502ACBC7179AAE392EA41A187396ADA.app18'}

headers = {
            'X-XSRF-TOKEN': '19672e30-69ca-420e-b649-a7e45031cda6', 
            'Content-Type': 'application/json;charset=UTF-8'
         }

response = requests.post('https://www.tedbaker.com/uk/json/c/category_womens_clothing', headers=headers, cookies=cookies)

response.json()

Output

{'success': True,
 'data': {'results': [{'code': '245655-BLACK',
    'name': 'Ditsy floral midi dress',
    'url': '/uk/Womens/Clothing/Dresses/PHILINA-Ditsy-floral-midi-dress-Black/p/245655-BLACK',
    'stock': {'stockLevelStatus': {'code': 'inStock',
      'type': 'StockLevelStatus'},
     'shippedFromStore': False},
    'price': {'currencyIso': 'GBP',
     'value': 99.0,
     'priceType': 'BUY',
     'formattedValue': '£99.00'}, ....

Code example Request with cookies

s = requests.Session()
cookies = s.get('https://www.tedbaker.com/uk/Womens/Clothing/c/category_womens_clothing').cookies

headers = {
            
            'Content-Type': 'application/json;charset=UTF-8'
         }
response = s.get('https://www.tedbaker.com/uk/json/c/category_womens_clothing', headers=headers, cookies=cookies)
response.json()

Explanation

We create a request session, so for whatever HTTP request we make, the session remains open. In this case, we make an HTTP Get request to the actual website address, grab the cookies as a result. We then use those cookies along with the minimum input for the headers that are required to get the JSON object.

No need to hard-code the cookies. However for some reason, a post request did not work, in this case a get request was sufficient.