0
votes

I'm working on a custom RSS feed aggregator which parses RSS feeds from various news-type sites, shows a summary and links back to the original site. Nothing terribly exciting.

I'm trying to get an image for each article as well by using the original page's og:image meta tag.

However, I'm finding that a lot of the URLs in the og:image tag return 400, 403 or 404 errors when programatically accessing the images.

Some seem to check for a browser's User Agent string in the headers, so for testing only, I've set my User-Agent string header to that of Safari's: this gets some og:image links working, but it is not an acceptable solution (a crawler masquerading as a browser).

This does not work for the majority of images though, which continue to return 400/403.

Assuming that all the sites I've tested do not have missing image files and they are proactively preventing anyone other than Facebook/Twitter from using those images, is there any other way to reliably and programmatically retrieve images to display in an RSS aggregator?

Feedly etc al all seem to have images for the vast majority of their aggregated content, so I'm not clear as to why I'm having such difficulty.

1

1 Answers

2
votes

You already found the solution, which is indeed not preferred: changing your User-Agent string.

You can also tackle the problem in an other way; instead of scraping the image yourself, you only save the URL of the image. In you RSS feed aggregator you use that direct image url, so that the browser that is performing the request is the real client, and not your (server-side initiated) crawler.

Would that work?