As it turned out, the problem here was not as it appeared.
TopicArn
and TargetArn
are actually interchangeable, in this case. (Why are there two possible ways to pass the topic ARN? SNS originally only supported topics as targets, but now supports other kinds of things, so this is likely a case of API backwards-compatibility, which AWS is typically very good at.)
tl;dr: Occasionally, when an IAM policy used by a Lambda Execution Role is modified, the Lambda function does not behave as if the policy change has occurred.
The reason it appeared to work only after changing from one to the other was related (on some level) to what happens when you update the code for a Lambda function.
The Lambda service manages the containers that run your code, with each container running not more than one concurrent invocation of the function at a time. Subsequent invocations may reuse the same container... but only when the code associated with the function is identical, unchanged.
Update the function code and then run the function again (for the first time with the new code in place) and you are guaranteed to be in a new container that hasn't been used before.
The new container must fetch the temporary credentials for the Lambda Execution Role using The AssumeRole
action in AWS Security Token Service, which provides a temporary AWS-Access-Key-ID and Secret Key, as well as a Session Token. Lambda stores these in environment variables as AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_SESSION_TOKEN
, where the AWS-SDK in your function (used to make calls to other services, like SNS, in this example) picks them up and uses them to sign requests.
What OP ran into was a quirk somewhere this setup. Something in IAM or STS results in cached or stale data inside AWS systems -- a policy (or the latest version of a policy) that was necessary to allow the Lambda execution role to perform the SNS:Publish
action against the topic in question was not yet visible to a component that needed to see it, in order for the action to be allowed.
It is not clear where exactly this caching occurs. Lambda of course caches the temporary credentials, but so does the EC2 metadata service, and so does any well-behaved client, since it wouldn't make sense to continually make requests to STS when the temporary credentials are still valid. I'm not convinced that the caching of the credentials, themselves, is the cause (though it contributes).
STS credentials are a black box, especially that "Session Token." Does it contain encrypted data? Or is it just a large random value, literally nothing more than a symbolic "token" with no intrinsic meaning? It doesn't really matter, but the point is that it isn't clearly documented.
IAM is a massive, distributed system, so it naturally does have "eventual consistency" issues that can arise on occasion.
But, somehow, creating a new Lambda container -- which necessarily must make a new call to STS -- seems to have a cache-busting side effect that makes the current IAM policies for the execution role become available in cases where things don't work, in a new deployment.
The seems to pop up on occasion after you try a given action, it fails, and you realize you need to edit the IAM policy that you intended to be sufficient to allowed the action. So you edit the policy, but subsequent attempts still fail, yet the policy seems valid, so you can't figure out why it isn't working, and you toss your hands in the air, send a message to John Rotenstein that you are at your wits' end... then, after giving up for the night in dispair, you return the next morning to troubleshoot further, and discover it's suddenly working.
Presumably this is related to the old execution role temporary credentials being replaced, either due to the containers being pruned and replaced due to inactivity, or simply because temporary credentials have a finite lifespan... but it is unclear whether refreshing the STS tokens is actually a necessary part of the solution, or whether it simply fixes a logjam in side IAM as a side effect.
Redeploying the Lambda function to test a code change would likely have the same effect, either way, as it did here.
TopicARN
is "The topic you want to publish to." I can understand thatTargetARN
should be used when publishing a message to a non-Topic (eg to a mobile app), but it is strange that it worked for sending to an SNS topic, too! I just did some testing and verified that it works for me using bothTopicARN
andTargetARN
. Presumably, I have sufficient permissions for both cases. – John Rotenstein