Cloudfront and Lambda@Edge - fetch from custom origin depending on user agent

Question

I'm serving my JavaScript app (SPA) by uploading it to S3 (default root object index.html with Cache-Control:max-age=0, no-cache, pointing to fingerprinted js/css assets) and configuring it as an origin to a CloudFront distribution. My domain name, let's say SomeMusicPlatform.com has a CNAME entry in Route53 containing the distribution URL. This is working great and all is well cached.

Now I want to serve a prerendered HTML version for purposes of bots and social network crawlers. I have set up a server that responds with a pre-rendered version of the JavaScript app (SPA) at the domain prerendered.SomeMusicPlatform.com.

What I'm trying to do in the lambda function is to detect the user agent, identify bots and serve them the prerendered version from my custom server (and not the JavaScript contents from S3 as I would normally serve to regular browsers).

I thought I could achieve this by using a Lambda@Edge: Using an Origin-Request Trigger to Change From an Amazon S3 Origin to a Custom Origin function that switches the origin to my custom prerender server in case it identifies a crawler bot in response headers (or, in the testing phase, with a prerendered=true query parameter).

The problem is that the Origin-Request trigger with the Lambda@Edge function is not triggering because CloudFront still has Default Root Object index.html cached and tends to return the content from the cached edge. I get X-Cache:RefreshHit from cloudfront by using both SomeMusicPlatform.com/?prerendered=true and SomeMusicPlatform.com, even though there is a Cache-Control:max-age=0, no-cache on the Default Root Object - index.html.

How can I keep the well-cached serving and low latency of my JavaScript SPA with CloudFront and add serving content from my custom prerender server just for crawler bots?

This is a bit complicated because you need to white list the User-Agent header for forwarding to the origin, which will hurt your cache hit ratio. You might consider a second trigger on the Viewer Request side, which would implement the UA detection logic and rewrite the User-Agent from its real value to one of two possible values, e.g. User-Agent: Browser or User-Agent: Bot. Then change the origin behavior based on detecting one of those two values in the Origin Request trigger. Two triggers, but optimal caching. Thoughts? — Michael - sqlbot
In the testing phase (before I go into UA detection) I'm trying to simplify it by using a prerendered=true parameter that dictates which origin should be used, either S3 or custom (my prerender server). The issue is that I always get a cached website no matter if I use the prerendered param or not. So, I go to mywebsite.com and get the content from S3. Then i visit mywebsite.com/?prerendered=true and get a cached hit from S3 when it should get it from the custom origin. At this point if I make an invalidation, mywebsite.com/?prerendered=true will get the content from my custom origin. — Matic Jurglič
... but then if I visit mywebsite.com (without the parameter), cached content from the custom origin will be returned, when it should use the cached content from S3 origin. How to make this work so that it switches to the correct origin depending on the parameter (and later, on the UA)? — Matic Jurglič

Matic Jurglič Matic Jurglič · Accepted Answer · 2017-11-28T22:14:18

The problem with caching (getting the same hit when using either mywebsite.com/?prerendered=true or mywebsite.com) was solved by adding prerendered to the query whitelist in the cloudfront distribution. This means that CloudFront now correctly maintains both normal and prerendered version of the website content, depending on presence of the parameter (without the parameter cached content from S3 origin is served, and with the parameter cached content from my custom origin specified in the lambda function is served).

This was enough for the testing phase - to ensure the mechanism is working correctly. Then I followed Michael's advice and added another lambda function in the Viewer Request trigger which adds a custom header Is-Bot in case a bot is detected in User-Agent. Again, whitelisting was needed, this time for the custom header (to maintain caches for both origins depending on the custom header). The other lambda function later in the Origin Request trigger then decides which origin to use, depending on the Is-Bot header.

Cloudfront and Lambda@Edge - fetch from custom origin depending on user agent

1 Answers