Adaptive Ephemeral Chaos Infrastructure for Resilience Validation in AWS Multi-Account Environments
DOI:
https://doi.org/10.63282/3050-9246.IJETCSIT-V7I1P148Keywords:
Chaos Engineering, AWS Fault Injection Simulator, Cloud Resilience, Distributed Systems, Multi-Account Architecture, CI/CD Orchestration, Convergent Restoration, Ephemeral InfrastructureAbstract
Running chaos engineering experiments safely across multiple AWS accounts is harder than it sounds. Most existing tools require manual setup per account, use fixed experiment templates that do not adapt to the application, and have no reliable way to confirm the system actually recovered after a test. This paper presents Adaptive Ephemeral Chaos Infrastructure (ECI), a framework that addresses these problems through four specified capabilities: a seven-phase experiment lifecycle that enforces safety at every step, a convergent recovery verification protocol that confirms system stability before an experiment is declared complete, a CI/CD integration with distributed collision detection, and a multi-account coordination protocol that synchronizes fault injection across AWS accounts. ECI also proposes two concepts reserved for future implementation i.e., an Application Dependency Graph for automatic topology mapping, and a centrality-based adaptive experiment selection model, described here at a design level to establish the intended direction of the framework. The paper concludes with a proposed evaluation methodology that outlines how the four implemented capabilities would be measured in a structured test environment.
Downloads
References
[1] H. S. Gunawi, M. Hao, R. O. Suminto, A. Laksono, A. D. Satria, J. Adityatama, and K. J. Eliazar, "Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages," Proc. ACM Symposium on Cloud Computing (SoCC), 2016. [Online]. Available: https://ucare.cs.uchicago.edu/pdf/socc16-cos.pdf
[2] Y. Izrailevsky and A. Tseitlin, "The Netflix Simian Army," Netflix Technology Blog, Jul. 2011. [Online]. Available: https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
[3] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, 2016. [Online]. Available: https://sre.google/sre-book/table-of-contents/
[4] Amazon Web Services, "AWS Fault Injection Simulator User Guide," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/fis/latest/userguide/what-is.html
[5] Gremlin Inc., "Gremlin Chaos Engineering Platform," 2024. [Online]. Available: https://www.gremlin.com/docs
[6] Chaos Mesh Authors, "Chaos Mesh: A Powerful Chaos Engineering Platform for Kubernetes," CNCF Project, 2021. [Online]. Available: https://chaos-mesh.org/docs/
[7] U. Mahapatra et al., "LitmusChaos: A Cloud-Native Chaos Engineering Framework," Proc. IEEE International Conference on Cloud Engineering (IC2E), 2022. [Online]. Available: https://litmuschaos.io/
[8] Principles of Chaos Engineering, "Principles of Chaos Engineering," last updated Mar. 2019. [Online]. Available: https://principlesofchaos.org/
[9] A. Basiri, L. Hochstein, N. Jones, and H. Tucker, "Automating Chaos Experiments in Production," Proc. IEEE/ACM 41st Int. Conf. on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 31–40, May 2019. DOI: 10.1109/ICSE-SEIP.2019.00012. [Online]. Available: https://arxiv.org/abs/1905.04648
[10] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal, "Chaos Engineering," IEEE Software, vol. 33, no. 3, pp. 35–41, May–Jun. 2016. DOI: 10.1109/MS.2016.60. [Online]. Available: https://arxiv.org/pdf/1702.05843
[11] Steadybit GmbH, "Steadybit Chaos Engineering Platform Documentation," 2024. [Online]. Available: https://docs.steadybit.com/
[12] Harness Inc., "Harness Chaos Engineering Documentation," Harness Developer Hub, 2024. [Online]. Available: https://developer.harness.io/docs/chaos-engineering/
[13] Amazon Web Services, "Chaos Engineering Leveraging AWS Fault Injection Simulator in a Multi-Account AWS Environment," AWS Cloud Operations Blog, Mar. 2022. [Online]. Available: https://aws.amazon.com/blogs/mt/chaos-engineering-leveraging-aws-fault-injection-simulator-in-a-multi-account-aws-environment/
[14] Amazon Web Services, "AWS Well-Architected Framework Reliability Pillar," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
