Failure without Faults: How Enterprise Storage Systems Degrade Long Before They Break

Authors

  • Mallikarjun Vppalapati Sr Storage Engineer at Vsion Technologies, USA. Author
  • Phani Kumar Talasila Storage engineer III at romedica health systems, USA. Author
  • Mallikarjun Vppalapati Sr Storage Engineer at Vsion Technologies, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V2I1P113

Keywords:

Enterprise Storage Systems, Performance Degradation, Silent Failures, Storage Observability, Predictive Maintenance, Reliability Engineering, Capacity Planning, Failure Modeling

Abstract

Enterprise storage systems have become the backbone of today’s digital infrastructures. However, the reliability of such systems is still predominantly measured by traditional failure models that look at specific failures such as a disk crash, controller failure, or an event of data unavailability. These models have been instrumental in identifying and isolating catastrophic failures, but they are not designed to detect a class of operational issues arising in large-scale, software-defined, and cloud-integrated storage platforms which are often considered “healthy” based on traditional metrics despite quietly losing their performance, efficiency, or resilience. The paper presents the idea of failure without faults in enterprise storage systems, whereby the systems deteriorate slowly from performance decay, resource contention, firmware or software drift, and latent configuration risks long before they finally break down or any alert is triggered. The paper also points out that the industry’s excessive dependence on binary health indicators and threshold-based alerts has resulted in a blind spot that conceals the early warning signals and postpones the rectification of problems. In line with this, we put forward an observability-driven approach that links low-level telemetry such as I/O tail latency, cache efficiency loss, queue depth variability, and rebuild amplification—with higher-level degradation markers and predictive risk signals. The approach is confirmed by the case study of a large storage deployment carried out in the real world where no components were reported to have failed but over time the application slowdowns became prolonged and the operational instability worsened. The key takeaway from the study was that degradation could be observed quite some time before users experienced problems and that degradation patterns could be seen only through the history of system behavior rather than in a one-off health check.

Downloads

Download data is not yet available.

References

[1] Dumitraş, Tudor, and Priya Narasimhan. "Why do upgrades fail and what can we do about it? Toward dependable, online upgrades in enterprise system." ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009.

[2] Baker, Mary, Kimberly Keeton, and Sean Martin. "Why traditional storage systems don’t help us save stuff forever." Proc. 1st IEEE Workshop on Hot Topics in System Dependability. 2005.

[3] 3.Haeberlen, Andreas, Alan Mislove, and Peter Druschel. "Glacier: Highly durable, decentralized storage despite massive correlated failures." Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2. 2005.

[4] Muppaneni, Rajarshi Krishna. “Retail Reimagined: How Dynamics 365 Commerce Is Driving Omnichannel Experiences”. International Journal of AI, BigData, Computational and Management Studies, vol. 1, no. 1, Mar. 2020, pp. 49-59

[5] Gunawi, Haryadi S., et al. "Fail-slow at scale: Evidence of hardware performance faults in large production systems." ACM Transactions on Storage (TOS) 14.3 (2018): 1-26.

[6] Yang, Junfeng, Can Sar, and Dawson Engler. "Explode: a lightweight, general system for finding serious storage system errors." Proceedings of the 7th symposium on Operating systems design and implementation. 2006.

[7] Ganesan, Aishwarya, et al. "Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults." ACM Transactions on Storage (TOS) 13.3 (2017): 1-33.

[8] Nath, Suman, et al. "Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems." NSDI. Vol. 6. 2006.

[9] Yuan, Ding, et al. "Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems." 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014.

[10] Di Martino, Catello, et al. "Lessons learned from the analysis of system failures at petascale: The case of blue waters." 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 2014.

[11] Narayanan, Dushyanth, and Orion Hodson. "Whole-system persistence." Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems. 2012.

[12] Kesavan, Ram, et al. "Countering fragmentation in an enterprise storage system." ACM Transactions on Storage (TOS) 15.4 (2020): 1-35.

[13] Dumitraş, Tudor, and Priya Narasimhan. "Why do upgrades fail and what can we do about it? Toward dependable, online upgrades in enterprise system." ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009.

[14] Morris, Robert JT, and Brian J. Truskowski. "The evolution of storage systems." IBM systems Journal 42.2 (2003): 205-217.

[15] Davenport, Thomas H. "Putting the enterprise into the enterprise system." Harvard business review 76.4 (1998): 121-131.

[16] Jiang, Weihang, et al. "Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics." ACM Transactions on Storage (TOS) 4.3 (2008): 1-25.

Published

2021-03-30

Issue

Section

Articles

How to Cite

1.
Vppalapati M, Talasila PK, Vppalapati M. Failure without Faults: How Enterprise Storage Systems Degrade Long Before They Break. IJETCSIT [Internet]. 2021 Mar. 30 [cited 2026 May 31];2(1):115-23. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/728

Similar Articles

21-30 of 558

You may also start an advanced similarity search for this article.