I'm working on a custom RSS feed aggregator which parses RSS feeds from various news-type sites, shows a summary and links back to the original site. Nothing terribly exciting.
I'm trying to get an image for each article as well by using the original page's og:image meta tag.
However, I'm finding that a lot of the URLs in the og:image tag return 400, 403 or 404 errors when programatically accessing the images.
Some seem to check for a browser's User Agent string in the headers, so for testing only, I've set my User-Agent string header to that of Safari's: this gets some og:image links working, but it is not an acceptable solution (a crawler masquerading as a browser).
This does not work for the majority of images though, which continue to return 400/403.
Assuming that all the sites I've tested do not have missing image files and they are proactively preventing anyone other than Facebook/Twitter from using those images, is there any other way to reliably and programmatically retrieve images to display in an RSS aggregator?
Feedly etc al all seem to have images for the vast majority of their aggregated content, so I'm not clear as to why I'm having such difficulty.