Many of the organizations use AWS as their cloud infrastructure, and in general they have multiple AWS accounts for production, staging, and development. Inevitably, this would result in losing track of various experimental AWS resources instantiated by your developers. Eventually, you will be paying AWS bills for resources which could have been identified and deleted in time.
So, how should you go about identifying and deleting such unnecessary resources? aws-nuke to the rescue.
aws-nuke is a tool that removes all resources from an AWS account.
At Sumo Logic, we have recently created an internal Jenkins wizard on top of aws-nuke to help our developers clean up their AWS accounts. We were able to identify and delete long-forgotten resources, some of which were instantiated years back for various POCs.
aws-nuke README page does a great job of explaining how to use the tool, and therefore we won’t reinvent the wheel. Rather, we are sharing some of the issues we faced and our learnings from using this tool for over two months now.
1. Develop an org-wide standard nuke config that allows customization
aws-nuke uses a YAML configuration file to target and filter resources. Given how destructive this tool is, no developer in their sane mind would like to play with it on their own. Therefore, developing a standard configuration for your entire organization is a must.
We can not emphasize enough on the importance of this step. Run, analyze, and verify your standard configuration a thousand times (and a thousand times more) before making it available to your developers.
There will always be a genuine requirement from some developers to save their resources deleted by the standard configuration. Provide a way for them to override the standard configuration.
2. Provide a dry-run option
aws-nuke’s default mode is non-destructive which only prints all the AWS resources that will be deleted when you run the tool in -no-dry-run mode. Your developers need to be able to use this default mode to save some of their resources from deletion if required.
While optional, parsing and formatting the report which is generated by aws-nuke can make it more readable.
3. Authenticating aws-nuke using AWS credentials
This will depend on how your organization set up AWS. After juggling through various approaches, we finally settled down with this approach of using static credentials:
- For each run of the tool, create a new temporary IAM user in the account which is being nuked.
- Provide root access to this temp-IAM-user.
- Ensure your nuke config filters this temp-IAM-user from deletion.
- Generate accessKeyId and secretAccessKey for this temp-IAM-user.
- Pass these static credentials to the aws-nuke tool.
- Delete this temp-IAM-user user after aws-nuke run finishes.
4. S3 objects and DynamoDB items create pain points
Our key takeaway from multiple runs of the tool by various engineers is that “Increasing the number of S3Objects and DynamoDB items to be deleted, slows down aws-nuke’s progress significantly.” This is because aws-nuke tracks the deletion of every resource. In the case of S3 objects and DynamoDB items, it continues to track them even if the corresponding S3 bucket and DynamoDB table (respectively) are deleted first. We came up with two approaches to overcome this issue.
a) Target only S3 and DynamoDB via aws-nuke config
We added a configuration that targets deleting only S3 and DynamoDB resources. If developers find their aws-nuke run to be slower than expected, they can first clean up these two resource categories. This works very efficiently with time-restricted runs which are discussed in the next section.
b) Use S3 lifecycle policies
You may also off-load the S3 clean up to AWS itself by using appropriate lifecycle policies.
5. Tackle memory consumption issue with ‘Fire and Forget’ model
We found aws-nuke to run out of memory for two reasons. First, if the account being nuked has a lot of resources (over a million in total). Second, if we run multiple aws-nuke processes on the same machine (depends on the machine config). We used two strategies to handle this.
- Run aws-nuke jobs with no concurrency, i.e. only a single job at a time.
- Restrict each run with a standard timeout to avoid a significant pile up in the job’s queue.
Thus, we have provided a fire-and-forget model to our developers via our Jenkins wizard. Each job times out if it can not finish in an hour. We send out an email to the developer, sharing their job’s results and further instructions to re-trigger the job if required.
PS: aws-nuke’s memory consumption bug is reported on GitHub.
6. Reduce aws-nuke logging as a last resort
aws-nuke logging is verbose for an obvious reason: “You must know what all will be irrevocably deleted.”
In our case, we were nuking both S3 Buckets and S3Objects. Here, S3 Bucket deletion takes care of deleting its S3Objects as well. As a result, aws-nuke spends a lot of time doing nothing but logging “Could not delete S3Object” for every such S3Object. If required, as a last resort, you may reduce the logging as suggested in this issue on GitHub.
aws-nuke is a potentially destructive yet very useful tool. Your ROI from using this tool will depend on the current state of your AWS accounts.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.