Build awareness and adoption for your software startup with Circuit.

Automating CloudFormation Stack Drift Remediation with AWS Eventbridge and Lambda

Problem

Whenever we work in a production cloud environment, we generally don't create resources directly by AWS console. Most of the time it is done by automated Terraform or CloudFormation template via a CI/CD pipeline. Therefore, there will be many CloudFormation or Terraform templates that will be created and maintained by the DevOps team. However, in certain cases, some team members can directly go to the AWS Console and modify the resources which can create a drift between the CloudFormation Stack and the underlined resources.

Detecting and remediating drifted stack resources can easily be accomplished in the AWS console, but as our stack or team grows, addressing these stack drifts can get tedious.

Approach

With AWS Lambda, we can develop a function that remediates stack drift by reverting any resources back to their original configurations. To automate this process, we can then configure an AWS EventBridge schedule to invoke this function on a schedule to periodically detect and remediate these drifted resources.

What we'll build

In this hands-on tutorial, we will develop end-to-end automation with EventBridge Scheduler and a Lambda function that accomplishes the detection and remediation process.

What we'll need

  1. CloudFormation Stack containing resources
  2. Lambda Function to remediate the drift
  3. Lambda Execution role
  4. EventBridge Scheduler for triggering the Lambda function every x minutes
  5. EvenBridge rule to invoke Lambda Function

Architecture

Steps

  1. First, let's create an s3 bucket and upload the CloudFormation stack containing the definition for the security group.

2. Next, we need to create the stack by referring to the CloudFormation template stored in the S3 bucket. So let's go to the cloud formation service, click on Create stack → With new Resources (standard), provide the cloud formation template S3 URL, select the default VPC ID from the dropdown, leave all other options as default, and click on Create Stack.

After some time, you should see the cloud formation stack in the CREATE_COMPLETE state, and in the resources section, you will see a security group is created. Now go to the Security Group and check the inbound calls. You will see that the inbound rules are present as per the CloudFormation Stack.

3. Now let's try to create a drift by going to the Security Group again and clicking on Edit inbound rules. Now change the inbound roles to anywhere ipv4 and save rules.

4. Go to CloudFormation stack again and under stack actions, click on Detect drift after this drift detection process will be initiated. Now to check the result, click on View Drift Results. You will see something similar to the below screenshot where you can clearly see that the security group was modified. And you can see the differences between the stack and the actual security group.

5. To remediate the drift, go back to the security group again, click on Edit Inbound Rules, and change back to the previous IP Cidr which was 10.0.0.0/20, and save inbound rules. Now, come back to the CloudFormation stack, click on stack actions, and click on detect drift. We will see that the drift status is IN_SYNC which means the resource is as per the CloudFormation stack.

Since we have done this manually via going to the resource and editing, the same can be automated using lambda function. Therefore, in the rest of the tutorial we are going to create a lambda function which is going to do the same detect drift and remediate it to the original song.

Before doing this via the Lambda function, let's download the code as below ---

GitHub

https://github.com/vishal2505/AWSDevOpsProjects/tree/main/Project-1

Detecting and remediation drift using Lambda

Let's create a lambda function to perform the above manual steps. So go to lambda function → provide function name as DetectDrift function → select Python 3.9 as runtime and also select Create a new role with basic lambda function permissions.

Now copy the code from the GitHub repository and paste it into the file index.py

Now configure the test event, provide the JSON, and provide the values for the variables → STACK_NAME, RESOURCE_TYPE, and SECURITY_GROUP_NAME. Save this event and click on Test.

We'll get an error because this lambda function doesn't have permission to the cloud formation stack and is not able to perform the DetectStackResourceDrift operation.

So let's add permission to our existing role. Go to the configuration and permissions, click on the role, and add permissions for AWS CloudFormation read-only access.

Try to invoke the test for the lambda function. You will again see an error that is not able to perform ec2:DescribeSecurityGroupRules and ec2:AuthorizeSecurityGroupIngress. So you need to add these policies in the role.

Now as a final test, go to the security group. Edit inbound rules, delete the inbound rule, and save it.

Now once you invoke the test, you will see that the detection status is DETECTION_COMPLETE and "Restored SSH security group successfully" in the logs. Verify this in the security group console. You will see inbound rules have appeared again. With this, you are able to perform drift detection and remediation using the lambda function.

Automating with AWS EventBridge Scheduler

The last step of this tutorial is to automate this using the AWS EventBridge scheduler. Let's go to the AWS EventBridge Scheduler page and create a schedule.

Click on Create Schedule → Provide schedule name and description.

Select the recurring schedule and the rate-based schedule. Provide the rate as 2 minutes. You can provide any value for that but test it more quickly, let's provide this value as 2 minutes now click on next.

Go to select Target → select AWS lambda invoke, now come down and then under Invoke section provide the payload with the values for STACK_NAME, RESOURCE_TYPE, and SECURITY_GROUP_NAME.

Click on Next again and provide permissions for this role. So just click on Create a new role for the schedule. You don't have to create this explicitly. It will be created automatically with the schedule.

Let's click on next and create. Now we'll see that DetectDriftEvery2Minutes schedule has been created

Finally, let's wait for 2 minutes and observe the security group and the logs in the CloudWatch.

In the logs, we see that the lambda function is getting triggered every two minutes and it is restoring the SSH security group whenever there is a drift.

Teardown Resources

The last step would be to terminate all the resources which we have created. It is very important to terminate all the resources so that you don't incur any charges so let's first terminate the EventBridge scheduler rule.

Delete the lambda function and the two roles created. Delete the CloudFormation stack and delete the s3 bucket.

Summary

In this tutorial, we've successfully accomplished the following tasks:

  • Created a security group using AWS CloudFormation
  • Detected and remediated unmanaged updates using CloudFormation drift detection
  • Configured an AWS Lambda function to act on our behalf to detect and remediate drifted settings on a security group
  • Automated the entire workflow to run on a schedule of every 2 minutes using an Amazon EventBridge schedule

We have also learned on a high level about AWS Lambda and AWS EventBridge and how to integrate them.

I'll be creating more AWS DevOps projects. Please follow to get an update.




Continue Learning