Buy Me a Coffee

Sunday, April 7, 2019

Week of Chaos


A. Principles of Chaos
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems  and builds confidence in the operational behavior of those systems.
B. What is a Chaos week?
A Chaos Week is a dedicated team week focused on using chaos engineering to reveal weaknesses in our systems. We’ve all heard of hack days and hack weeks, where you focus on building new features. Well, a Chaos week is focused on building more resilient systems by breaking things on purpose.

C. References

D. Chaos Week Goals and Approach
1. Check data integrity issues to make sure that our customer data is safe.
2. Testing reflect real-world risks and impact:
  • Separate attackers from support team for realistic simulation of MTTD/MTTR (real-world surprise).
  • Test scenarios to reflect chronic real-world problems encountered in Autodesk environment
  • Testing and tools capture enough detail to measure MTTR and MTTD (if needed - engage facilitator until sufficient automated measurement removes need)
  • System under test to have load running against it (load or smoke tests)
  • System under test has same tools instrumented as production
3. Testing identifies helpful and missing resources necessary for 99.9 % MTTR/MTTD to satisfy SLA (monitoring / metrics / logging / escalation / incident):
  • Support team shares specific tools used (monitoring / logging / runbooks/...) for scenarios, efficacy of tool, and identifies gaps (during or postmortem)
  • Support team identifies critical Single Points of Failure (SPOF) for scenarios (people, systems, ...)
  • Ensure scale capacity of 10x with no customer impact
4. Testing embraces org principles: customer focus, automation, quality, security, transparency, blameless postmortem, continuous improvement:
  • Automation: Leverage existing available frameworks for Chaos and Load testing
  • Transparency: No issues to be hidden
  • Continuous Improvement: gather and synthesize data to enable trending for future test runs
  • Customer Focus: Prioritize testing and remediation based on reduction of MTTD/MTTR.
  • Coordinate testing to mitigate impact to other teams and ensure Chaos testing will not impact real customers
  • Security: Do not compromise on security.

E. Some test scenarios example for Chaos Week
  1. RDS Datais deleted \ RDS has been deleted (Full recovery from snapshot needed)
Here is an example of running it using Terraform 
  1. AZ in unavailable
  2. S3 File or folder has been deleted
  3. S3 bucket has been deleted
  4. Kill containers/instances
1.    Kill one service instance - watch for automatic restart - record results
2.    Kill instances in one AZ/region - watch for automatic restart - record results
3.    Kill all instances - watch for automatic restart -record results
  1. Stop instances/processes
1.    Stop process on one instance - watch for health check failure and self-resolution - record results
2.    Stop process on all instances - watch for health check failures and self-resolution - record results
  1. Database failover
1.    Failover Aurora writer - watch and record results
2.    Failover Aurora reader - watch and record results
3.    Failover RDS - watch and record results
  1. Redis/Elasticache failover
1.    Kill one slave - watch and record results
2.    Kill all slaves - watch and record results
3.    Stop all slaves - watch and record results
4.    Kill master - watch and record results
  1. Simulate 429's error response from an external depended service (error rate < 100% )
  2. Simulate timeout / inaccessible response from an external depended service (error rate = 100% )


You are more then welcome to suggest more test scenarios and comment below