Integrating Site Reliability Engineering SRE Principles into Enterprise Architecture for Predictive Resilience
DOI:
https://doi.org/10.63282/3050-9246.IJETCSIT-V4I3P117Keywords:
Site Reliability Engineering, Enterprise Architecture, Observability, Predictive Resilience, Aiops, Service Level ObjectivesAbstract
Modern enterprises increasingly depend on complex distributed software systems where small faults can cascade into large customer impact. Site Reliability Engineering provides a disciplined approach to reliability through explicit service level objectives, error budgets, automation, incident response and continuous learning. Enterprise Architecture provides an enterprise-wide design view that connects business capabilities, information flows and technology platforms. This paper proposes an integrated framework that makes reliability a first-class architecture concern and that links architecture decisions to runtime evidence. The framework introduces an artifact mapping between Enterprise Architecture models and SRE primitives such as service level indicators, service level objectives and error budgets. It also defines a predictive resilience loop that combines observability telemetry with architecture context and change signals to anticipate degradation risk before user impact occurs. The paper synthesizes related work on resilience, observability, trace and log anomaly detection and interpretable root cause analysis then proposes implementation patterns for SLO hierarchies, drift detection, policy driven release governance and chaos experiments. Finally, it defines evaluation metrics and an illustrative enterprise scenario that demonstrates how predictive signals can trigger targeted governance actions and architecture updates
Downloads
References
[1] J. Dean and L. A. Barroso, "The Tail at Scale," Communications of the ACM, vol. 56, no. 2, pp. 74 to 80, 2013, doi: 10.1145/2408776.2408794.
[2] B. Treynor et al., "The Evolving SRE Engagement Model," Communications of the ACM, 2017, doi: 10.1145/3080202.
[3] S. Sha et al., "Error Budgets and SLOs," Communications of the ACM, 2019, doi: 10.1145/3369756.
[4] R. Winter and R. Fischer, "Enterprise Architecture Governance," in Proceedings of the 2008 ACM Symposium on Applied Computing, 2008, doi: 10.1145/1363686.1363820.
[5] Lwakatare, L. E., Kuvaja, P., & Oivo, M. (2016). Relationship between DevOps and lean: A multi‑case study of software organizations. Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (EASE), 1–10. https://doi.org/10.1145/2915970.2915996
[6] A. Aldea et al., "Enterprise Architecture Resilience by Design: A Method and Guidance," in 2021 IEEE EDOC Workshops, 2021, doi: 10.1109/EDOCW52865.2021.00044.
[7] Lyu, Y., Rajbahadur, G. K., Lin, D., Chen, B., Zhen, M., & Jiang. (2022). Towards a consistent interpretation of AIOps models. arXiv. https://doi.org/10.48550/arXiv.2202.02298
[8] Xin, R., Chen, P., & Zhao, Z. (2022). CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications. arXiv. https://arxiv.org/abs/2209.02500
[9] O. Erol, B. J. Sauser and M. Mansouri, "A Framework for Investigation into Extended Enterprise Resilience," Enterprise Information Systems, vol. 4, no. 2, pp. 111 to 136, 2010, doi: 10.1080/17517570903474304.
[10] T. J. Vogus and K. M. Sutcliffe, "Organizational Resilience: Towards a Theory and Research Agenda," in 2007 IEEE International Conference on Systems, Man and Cybernetics, 2007, doi: 10.1109/ICSMC.2007.4414160.
[11] Gunda, S. K. G. (2023). The Future of Software Development and the Expanding Role of ML Models. International Journal of Emerging Research in Engineering and Technology, 4(2), 126-129. https://doi.org/10.63282/3050-922X.IJERET-V4I2P113
[12] A. Basiri et al., "Chaos Engineering," IEEE Software, vol. 33, no. 3, pp. 35 to 41, 2016, doi: 10.1109/MS.2016.60.
[13] Y. Wu et al., "Large Scale Trace Analysis for Microservice Anomaly Detection and Localization," in Proceedings of the ACM SIGKDD Conference, 2022, doi: 10.1145/3531056.3542765.
[14] Xu, W., Huang, L., Fox, A., Patterson, D., & Jordan, M. I. (2010). Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP ’09) (pp. 117–132). https://doi.org/10.1145/1629575.1629588
[15] C. Zhang et al., "DeepTraLog: Trace Log Combined Microservice Anomaly Detection," in Proceedings of the ACM International Conference on Software Engineering, 2022, doi: 10.1145/3510003.3510180.
[16] M. Du et al., "DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning," in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017, doi: 10.1145/3133956.3134015.
[17] Z. Li et al., "Root Cause Analysis of Anomalies Based on Graph Neural Network in Microservice Systems," International Journal of Software Engineering and Knowledge Engineering, 2022, doi: 10.1142/S0218194022500395.
[18] Y. Chen et al., "SLA Decomposition: Translating Service Level Objectives to System Level Thresholds," in 2007 International Conference on Autonomic Computing, 2007, doi: 10.1109/ICAC.2007.36.
[19] R. A. Ariyaluran Habeeb et al., "Real Time Big Data Processing for Anomaly Detection: A Survey," International Journal of Information Management, vol. 45, pp. 289 to 307, 2019, doi: 10.1016/j.ijinfomgt.2018.08.006.
[20] U. Naseer et al., "Zero Downtime Release: Disruption Free Load Balancing of a Multi Billion User Website," in Proceedings of ACM SIGCOMM, 2020, doi: 10.1145/3387514.3405885.
[21] Ye, X., & Reza, M. (2020). Statistical learning and regression techniques for resilience assessment in infrastructure systems. Reliability Engineering & System Safety, 198, 106848. https://doi.org/10.1016/j.ress.2020.106848
[22] S. Fedushko et al., "Cloud Platforms and Enterprise Data: Components, Principles and Approaches," Applied Sciences, 2020, doi: 10.3390/app10249112.
[23] Eljabiri, A. R., Selim, G., & Schmidt, R. (2018). Leveraging enterprise architecture for cybersecurity management: A framework and its application. Journal of Information Security and Applications, 39, 41–52.
[24] Hillmann et al., "Integrated Enterprise Resilience Architecture Framework for Surviving Strategic Disruptions," Enterprise Risk Management, 2018, doi: 10.5296/erm.v4i1.13715.
