I have a several crawlers that crawls multiple sites and stores the contents in a database. The logs from the program are stored in CloudWatch Logs.
If the crawlers successfully pulls back content it looks like similarly to below
HTTP GET: 200 - https://www.thecheyennepost.com/news/national/r
HTTP GET: 200 - https://www.thecheyennepost.com/news/f-e-warren-hous
The issue I'm dealing with is identifying when 400 errors pop up. Below is an example:
HTTP GET: 429 - https://www.livingstonparishnews.com/search/?l=25&sort=
HTTP GET: 429 - https://www.livingstonparishnews.com/search/?l=25&sort=rele
HTTP GET: 429 - https://www.ktbs.com/search/?l=25&s=start_time&sd=desc&f=
I tried using status_code=4*
but that didn't do anything
I just want to be able to filter any and all 400 errors.
Any help that can be provided would be greatly appreciated.