2
votes

I have created aws infrastructure with collection EC2, Redshift, VPC etc. via CLOUDFORMATION. Now I want to delete it in particular reverse order. Exa. All resources are dependent on VPC. VPC should be deleted at the end. But somehow every stack is deleting but VPC stack is not deleting via python BOTO3.It shows some subnet or network interface dependency error. But when I try to delete via console, It deletes it successfully. Has anyone faced this issue?

I have tried to delete everyting like loadbalancer which is attached to it. But still VPC is not deleting.

2
AWS console does the job of deleting dependencies for you. Could you please explain, what do you mean with "I have tried to delete everyting like loadbalancer which is attached to it. However It is not deleting" - loadbalancer is not deleting or VPC?gp42
I think you probably have some VM whose NIC is in a subnet within that VPC. Try finding out what NICs are in that VPC. Cloudformation delete usually takes care of dependency tracking for you so it's probably something outside your cloudformation stack.rdas
@gp42 I have updated the queston. VPC is not deleingImPurshu
Do you have a lambda which runs within the VPC? If yes, then I can provide you with a lambda which would clean up all associated network interfaces. I had a similar issue and I created that lambda which is triggered using custom resourceBiplob Biswas
Yes @Biplob , I have lambda which runs within the VPC. Can you share with me that script? Maybe I can try that.ImPurshu

2 Answers

4
votes

AWS CloudFormation creates a dependency graph between resources based upon DependsOn references in the template and references between resources.

It then tries to deploy resources in parallel, but takes dependencies into account.

For example, a Subnet might be defined as:

Subnet1:
    Type: AWS::EC2::Subnet
    Properties:
      CidrBlock: 10.0.0.0/24
      VpcId: !Ref ProdVPC

In this situation, there is an explicit reference to ProdVPC, so CloudFormation will only create Subnet1 after ProdVPC has been created.

When a CloudFormation stack is deleted, the reverse logic is applied. In this case, Subnet1 will be deleted before ProdVPC is deleted.

However, CloudFormation is not aware of resources created outside of the stack. This means that if a resource (eg an Amazon EC2 instance) is created inside the Subnet, then stack deletion will fail because the Subnet cannot be deleted while there is an EC2 instance using it (or, more accurately, an ENI is attached to it).

In such situations, you will need to manually delete the resources that are causing the "delete failure" and then try the delete command again.

A good way to find such resources is to look in the Network Interfaces section of the EC2 management console. Make sure that there are no interfaces connected to the VPC.

0
votes

As you specified that you are having issues with deleting VPC within stacks containing lambdas which themselves are in VPC, this most probably could be because of the network interfaces being generated by lambdas to connect to other resources in the VPC.

Technically these network interfaces should be auto-deleted when lambdas are undeployed from the stack but in my experience, I have observed orphaned ENI's which doesn't let the VPC be undeployed.

For this reason, I created a custom resource backed lambda which cleans up the ENI's after all lambdas within VPC's have been undeployed.

This is the cloud formation part where you setup the custom resource and pass the VPC ID

##############################################
#                                            #
#  Custom resource deleting net interfaces   #
#                                            #
##############################################

  NetInterfacesCleanupFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src
      Handler: cleanup/network_interfaces.handler
      Role: !GetAtt BasicLambdaRole.Arn
      DeploymentPreference:
        Type: AllAtOnce
      Timeout: 900

  PermissionForNewInterfacesCleanupLambda:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:invokeFunction
      FunctionName:
        Fn::GetAtt: [ NetInterfacesCleanupFunction, Arn ]
      Principal: lambda.amazonaws.com

  InvokeLambdaFunctionToCleanupNetInterfaces:
    DependsOn: [PermissionForNewInterfacesCleanupLambda]
    Type: Custom::CleanupNetInterfacesLambda
    Properties:
      ServiceToken: !GetAtt NetInterfacesCleanupFunction.Arn
      StackName: !Ref AWS::StackName
      VPCID:
        Fn::ImportValue: !Sub '${MasterStack}-Articles-VPC-Ref'
      Tags:
        'owner': !Ref StackOwner
        'task': !Ref Task

And this is the corresponding lambda. This lambda tries 3 times to detach and delete orphaned network interfaces and if fails if it can't which means there's still a lambda which is generating new network interfaces and you need to debug for that.

import boto3
from botocore.exceptions import ClientError
from time import sleep

# Fix this wherever your custom resource handler code is
from common import cfn_custom_resources as csr
import sys

MAX_RETRIES = 3
client = boto3.client('ec2')


def handler(event, context):

    vpc_id = event['ResourceProperties']['VPCID']

    if not csr.__is_valid_event(event, context):
        csr.send(event, context, FAILED, validate_response_data(result))
        return
    elif event['RequestType'] == 'Create' or event['RequestType'] == 'Update':
        result = {'result': 'Don\'t trigger the rest of the code'}
        csr.send(event, context, csr.SUCCESS, csr.validate_response_data(result))
        return
    try:
        # Get all network intefaces for given vpc which are attached to a lambda function
        interfaces = client.describe_network_interfaces(
            Filters=[
                {
                    'Name': 'description',
                    'Values': ['AWS Lambda VPC ENI*']
                },
                {
                    'Name': 'vpc-id',
                    'Values': [vpc_id]
                },
            ],
        )

        failed_detach = list()
        failed_delete = list()

        # Detach the above found network interfaces
        for interface in interfaces['NetworkInterfaces']:
            detach_interface(failed_detach, interface)

        # Try detach a second time and delete each simultaneously
        for interface in interfaces['NetworkInterfaces']:
            detach_and_delete_interface(failed_detach, failed_delete, interface)

        if not failed_detach or not failed_delete:
            result = {'result': 'Network interfaces detached and deleted successfully'}
            csr.send(event, context, csr.SUCCESS, csr.validate_response_data(result))
        else:
            result = {'result': 'Network interfaces couldn\'t be deleted completely'}
            csr.send(event, context, csr.FAILED, csr.validate_response_data(result))
            # print(response)
    except Exception:
        print("Unexpected error:", sys.exc_info())
        result = {'result': 'Some error with the process of detaching and deleting the network interfaces'}
        csr.send(event, context, csr.FAILED, csr.validate_response_data(result))


def detach_interface(failed_detach, interface):
    try:

        if interface['Status'] == 'in-use':
            detach_response = client.detach_network_interface(
                AttachmentId=interface['Attachment']['AttachmentId'],
                Force=True
            )

            # Sleep for 1 sec after every detachment
            sleep(1)

            print(f"Detach response for {interface['NetworkInterfaceId']}- {detach_response}")

            if 'HTTPStatusCode' not in detach_response['ResponseMetadata'] or \
                    detach_response['ResponseMetadata']['HTTPStatusCode'] != 200:
                failed_detach.append(detach_response)
    except ClientError as e:
        print(f"Exception details - {sys.exc_info()}")


def detach_and_delete_interface(failed_detach, failed_delete, interface, retries=0):

    detach_interface(failed_detach, interface)

    sleep(retries + 1)

    try:
        delete_response = client.delete_network_interface(
            NetworkInterfaceId=interface['NetworkInterfaceId'])

        print(f"Delete response for {interface['NetworkInterfaceId']}- {delete_response}")
        if 'HTTPStatusCode' not in delete_response['ResponseMetadata'] or \
                delete_response['ResponseMetadata']['HTTPStatusCode'] != 200:
            failed_delete.append(delete_response)
    except ClientError as e:
        print(f"Exception while deleting - {str(e)}")
        print()
        if retries <= MAX_RETRIES:
            if e.response['Error']['Code'] == 'InvalidNetworkInterface.InUse' or \
                    e.response['Error']['Code'] == 'InvalidParameterValue':
                retries = retries + 1
                print(f"Retry {retries} : Interface in use, deletion failed, retrying to detach and delete")
                detach_and_delete_interface(failed_detach, failed_delete, interface, retries)
            else:
                raise RuntimeError("Code not found in error")
        else:
            raise RuntimeError("Max Number of retries exhausted to remove the interface")

The link to the lambda is https://gist.github.com/revolutionisme/8ec785f8202f47da5517c295a28c7cb5

More information about configuring lambdas in a VPC - https://docs.aws.amazon.com/lambda/latest/dg/vpc.html