Keep a Tip: April 2019

Sunday, April 7, 2019

Week of Chaos

A. Principles of Chaos

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems  and builds confidence in the operational behavior of those systems.

B. What is a Chaos week?

A Chaos Week is a dedicated team week focused on using chaos engineering to reveal weaknesses in our systems. We’ve all heard of hack days and hack weeks, where you focus on building new features. Well, a Chaos week is focused on building more resilient systems by breaking things on purpose.

C. References

https://www.slideshare.net/hornsby/chaos-engineering-why-breaking-things-should-be-practised-93761039

D. Chaos Week Goals and Approach

1. Check data integrity issues to make sure that our customer data is safe.

2. Testing reflect real-world risks and impact:

Separate attackers from support team for realistic simulation of MTTD/MTTR (real-world surprise).

Test scenarios to reflect chronic real-world problems encountered in Autodesk environment

Testing and tools capture enough detail to measure MTTR and MTTD (if needed - engage facilitator until sufficient automated measurement removes need)

System under test to have load running against it (load or smoke tests)

System under test has same tools instrumented as production

3. Testing identifies helpful and missing resources necessary for 99.9 % MTTR/MTTD to satisfy SLA (monitoring / metrics / logging / escalation / incident):

Support team shares specific tools used (monitoring / logging / runbooks/...) for scenarios, efficacy of tool, and identifies gaps (during or postmortem)

Support team identifies critical Single Points of Failure (SPOF) for scenarios (people, systems, ...)

Ensure scale capacity of 10x with no customer impact

4. Testing embraces org principles: customer focus, automation, quality, security, transparency, blameless postmortem, continuous improvement:

Automation: Leverage existing available frameworks for Chaos and Load testing

Transparency: No issues to be hidden

Continuous Improvement: gather and synthesize data to enable trending for future test runs

Customer Focus: Prioritize testing and remediation based on reduction of MTTD/MTTR.

Coordinate testing to mitigate impact to other teams and ensure Chaos testing will not impact real customers

Security: Do not compromise on security.

E. Some test scenarios example for Chaos Week

RDS Datais deleted \ RDS has been deleted (Full recovery from snapshot needed)

Here is an example of running it using Terraform

AZ in unavailable

S3 File or folder has been deleted

S3 bucket has been deleted

Kill containers/instances

1.    Kill one service instance - watch for automatic restart - record results

2.    Kill instances in one AZ/region - watch for automatic restart - record results

3.    Kill all instances - watch for automatic restart -record results

Stop instances/processes

1.    Stop process on one instance - watch for health check failure and self-resolution - record results

2.    Stop process on all instances - watch for health check failures and self-resolution - record results

Database failover

1.    Failover Aurora writer - watch and record results

2.    Failover Aurora reader - watch and record results

3.    Failover RDS - watch and record results

Redis/Elasticache failover

1.    Kill one slave - watch and record results

2.    Kill all slaves - watch and record results

3.    Stop all slaves - watch and record results

4.    Kill master - watch and record results

Simulate 429's error response from an external depended service (error rate < 100% )

Simulate timeout / inaccessible response from an external depended service (error rate = 100% )

You are more then welcome to suggest more test scenarios and comment below

Subscribe to: Posts (Atom)