How to find out who did what in AWS

CloudTrail logs all AWS account activity, enabling detailed audits to identify issues. Learn how it helped resolve a production outage by pinpointing accidental changes.

A man dressed in a detective costume, holding a magnifying glass
Photo by Andres Siimon / Unsplash

Ever had an unexpected problem suddenly appear in your environment, seemingly out of nowhere? Maybe the problem has an extremely specific cause, and can only be fixed by rolling back exactly the modifications that triggered it. Having IaC (i.e. Terraform) might help with restarting the infrastructure in some cases, but not all, and it doesn't necessarily provide you with enough information to prevent similar problems in the future. To address this, AWS has a tool called CloudTrail, which provides a solution for logging, monitoring, and auditing account activity within your AWS infrastructure.

Enabling CloudTrail

CloudTrail is active in your AWS account when you create it. Once activated, CloudTrail automatically begins capturing and storing event logs from your AWS account, providing a comprehensive audit trail.

CloudTrail is available for free, allowing you to see events from the most recent 90-day history, which can be accessed either through the console or through the CloudTrail API. The latter allows for custom automation such as a Python script to detect anomalous activity, for example. There are also paid tiers which provide integration with other resources, such as S3 buckets to store data for longer periods.

Viewing Event History

Through the console, you can access the CloudTrail service and navigate to the "Event history" section. Here, you'll find a searchable, browsable record of all API calls made across your AWS infrastructure.

CloudTrail main screen, with "Event history" highlighted
CloudTrail main screen, with "Event history" highlighted

The event history view allows you to filter events based on various criteria such as time range, resource type, and specific API actions. You can also view detailed information about each event, including the request parameters, the AWS service involved, and the identities of the principals (users or roles) that initiated the activity.

With CloudTrail's event history, you gain visibility into not just infrastructure-level changes, but also management operations like creating new users or policies within IAM. This level of traceability is invaluable for maintaining security, ensuring compliance, and conducting forensic investigations when needed.

For more information, check out the AWS docs.

Resolving a Production Outage with CloudTrail

Our team faced a difficult situation a few weeks ago when our mission-critical production environment encountered an issue, halting operations. We initially struggled to pinpoint the root cause of the problem. However, we had AWS CloudTrail enabled, which proved invaluable in resolving the incident.

CloudTrail records API calls made within your AWS account, capturing critical information such as the identity of the caller, the time of the event, the source IP address, and the request parameters. This also includes console users who interact with resources through the browser. By analyzing the detailed audit trail provided by CloudTrail, our team could identify who had inadvertently made a configuration change that led to a series of failures, and what actions that person had done to the affected resources.

This information allowed us to undo the configuration that was causing the problems in the production environment, and also gave us the opportunity to talk to the person who had made the mistake, in order to learn from it and prevent it from happening again.

If you're curious, the problem was a misconfigured VPC. A resource lost access to the internet when its private subnet's outgoing gateway was switched from a NAT Gateway to an Internet Gateway, which requires a public IP, which private subnets do not have. From the AWS docs:

To provide your instances with internet access without assigning them public IP addresses, use a NAT device instead. A NAT device enables instances in a private subnet to connect to the internet, but prevents hosts on the internet from initiating connections with the instances.

The AWS Network Reachability Analyzer also played an important role in identifying the problem, but we could only be 100% sure of what happened through CloudTrail.

Without it, identifying the source of the issue and determining accountability would have been significantly more challenging and time-consuming (dare I say impossible, without depending on the person owning up to their honest mistake). The contextual information provided by CloudTrail enabled us to recreate the exact sequence of events that led to the outage, allowing us to develop an effective recovery plan and restore our production environment promptly.