4
votes

I was trying to publish a message to an SNS topic using boto3 in Lambda like so:

def publish_msg(msg, bar):
    response = SNS.publish(
        TopicArn='blah_blah_arn',
        Message=msg,
        MessageAttributes={
            'foo': {
                'DataType': 'String',
                'StringValue': bar
            }
        }
    )

which didn't work becuase it kept giving me an auth error that went something like this:

Error publishing message because lambda_fn_role doesn't have the permissions to invoke SNS:Publish on resource blah_blah_arn

But i was certain my policy for that function was correct, so i changed TopicARN to TargetARN and it worked!

So my question is this: What is the difference between topic and target ARN? and when should one be used over the other?

The AWS docs for boto3 doesn't answer this question at all.

Much appreciated!

1
That is indeed interesting! I agree that the documentation doesn't distinguish much between the two, aside from saying that TopicARN is "The topic you want to publish to." I can understand that TargetARN should be used when publishing a message to a non-Topic (eg to a mobile app), but it is strange that it worked for sending to an SNS topic, too! I just did some testing and verified that it works for me using both TopicARN and TargetARN. Presumably, I have sufficient permissions for both cases.John Rotenstein
I'd suggest you try changing it back, now, and I suspect it will work as expected regardless of which option you provide. I believe the problem was related to caching of the execution role's temporary credentials, not anything in your code. To make the change you made, you had to redeploy the Lambda function, which would've automatically deployed the function into a new container with a fresh set of temporary credentials for the execution role, and this would fix that problem. This seems particularly likely if you edited the role permissions shortly before or anytime after the prior deploy.Michael - sqlbot
Legend @Michael! That did it, thanks for the explanation behind why that worked too, very helpful!Red

1 Answers

3
votes

As it turned out, the problem here was not as it appeared.

TopicArn and TargetArn are actually interchangeable, in this case. (Why are there two possible ways to pass the topic ARN? SNS originally only supported topics as targets, but now supports other kinds of things, so this is likely a case of API backwards-compatibility, which AWS is typically very good at.)

tl;dr: Occasionally, when an IAM policy used by a Lambda Execution Role is modified, the Lambda function does not behave as if the policy change has occurred.

The reason it appeared to work only after changing from one to the other was related (on some level) to what happens when you update the code for a Lambda function.

The Lambda service manages the containers that run your code, with each container running not more than one concurrent invocation of the function at a time. Subsequent invocations may reuse the same container... but only when the code associated with the function is identical, unchanged.

Update the function code and then run the function again (for the first time with the new code in place) and you are guaranteed to be in a new container that hasn't been used before.

The new container must fetch the temporary credentials for the Lambda Execution Role using The AssumeRole action in AWS Security Token Service, which provides a temporary AWS-Access-Key-ID and Secret Key, as well as a Session Token. Lambda stores these in environment variables as AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN, where the AWS-SDK in your function (used to make calls to other services, like SNS, in this example) picks them up and uses them to sign requests.

What OP ran into was a quirk somewhere this setup. Something in IAM or STS results in cached or stale data inside AWS systems -- a policy (or the latest version of a policy) that was necessary to allow the Lambda execution role to perform the SNS:Publish action against the topic in question was not yet visible to a component that needed to see it, in order for the action to be allowed.

It is not clear where exactly this caching occurs. Lambda of course caches the temporary credentials, but so does the EC2 metadata service, and so does any well-behaved client, since it wouldn't make sense to continually make requests to STS when the temporary credentials are still valid. I'm not convinced that the caching of the credentials, themselves, is the cause (though it contributes).

STS credentials are a black box, especially that "Session Token." Does it contain encrypted data? Or is it just a large random value, literally nothing more than a symbolic "token" with no intrinsic meaning? It doesn't really matter, but the point is that it isn't clearly documented.

IAM is a massive, distributed system, so it naturally does have "eventual consistency" issues that can arise on occasion.

But, somehow, creating a new Lambda container -- which necessarily must make a new call to STS -- seems to have a cache-busting side effect that makes the current IAM policies for the execution role become available in cases where things don't work, in a new deployment.

The seems to pop up on occasion after you try a given action, it fails, and you realize you need to edit the IAM policy that you intended to be sufficient to allowed the action. So you edit the policy, but subsequent attempts still fail, yet the policy seems valid, so you can't figure out why it isn't working, and you toss your hands in the air, send a message to John Rotenstein that you are at your wits' end... then, after giving up for the night in dispair, you return the next morning to troubleshoot further, and discover it's suddenly working.

Presumably this is related to the old execution role temporary credentials being replaced, either due to the containers being pruned and replaced due to inactivity, or simply because temporary credentials have a finite lifespan... but it is unclear whether refreshing the STS tokens is actually a necessary part of the solution, or whether it simply fixes a logjam in side IAM as a side effect.

Redeploying the Lambda function to test a code change would likely have the same effect, either way, as it did here.