Many AWS customers run their workloads across multiple AWS accounts. Therefore they want to be able to run chaos experiments across accounts to understand how their workload behaves during a cascading or correlated failure. Today, AWS Fault Injection Simulator does not yet support targets in different accounts, but this doesn’t hinder us to run experiments via AWS StepFunctions which has great integrations with AWS Fault Injection Simulator.
AWS StepFunctions allows us to create states with the following actions:
If you are interested in having a central place where you create experiments and fan out experiments to the various accounts via service catalog, you can read up on it here. In the case of this blog post, I’ve created an experiment that only executes a chaos-mesh experiment via FIS in account A and reboots an EC2 instances via FIS in account B. You will point the execution steps to your own FIS Experiment Templates.
Please keep in mind that when running chaos experiments in your environment you’d want to follow the following workflow before the execution of the experiment. As the goal of this blog post is to provide you with insights on how to built a StepFunction that can run experiments cross accounts, I will therefore skip much of this workflow and only focus on the FIS execution via StepFunction.
For our workload, that is comprised of an EKS Cluster in account A, and a Database on EC2 in account B, I will build the following state machine
Before we can start, you will need to create a IAM role in account B that you will use to allow account A to assume the FIS-Execution role.
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FisExecutionRole",
"Effect": "Allow",
"Action": [
"fis:StartExperiment",
"fis:TagResource"
],
"Resource": "*"
}
]
}
You will also have to define a trust policy for this role so that account A is authorized to assume the role in account B
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::YourAccountID:root"
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}
Note the ARN of the role in account B as you will need it in the Step Function Step in account A!
Let’s create a state machine in account A and click next.
In the search field on the top left enter FIS
and drag the StartExperiment tab into your State Machine workflow. Rename it as you like. You should see something like this
Click twice and notice the banner on the bottom of the page! We will add the permissions once the role is created.
Give the StateMachine a name and click
This will get you to the following page. Click on Edit Role in IAM to Add the missing permissions.
Keep in mind that our role does also need assume role permissions for the cross account access. We are therefore adding the following permissions
Allow the role to assume all resources
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeRole",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "*"
}
]
}
as well as execute the experiment.
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FisExecutionRole",
"Effect": "Allow",
"Action": [
"fis:StartExperiment",
"fis:TagResource"
],
"Resource": "*"
}
]
}
Now go back to your state machine and add a second step as follows
Make sure that for the IAM role for cross-account access – optional you chose
Provide IAM role ARN and choose the role ARN in account B that you’ve created before! arn:aws:iam::AccountNumberB:role/fisfullaccess
Click apply the changes. You are now ready to execute the State Machine.
You should see both states turning green
You can now go verify on both accounts in your FIS console that both experiments were executed with the Tag names defined in your step functions step!
For a comprehensive workflow in a single account please review Chaos experiments using AWS Step Functions and AWS Fault Injection Simulator