Dual-Model Machine Learning for Predictive Leak Forecasting and Detection in Liquid-Cooled AI Data Centers

Authors

  • Krishna Chaitanya Sunkara AI Data Center Engineering and Cloud Infrastructure, Oracle, Raleigh USA. Author
  • Rambabu Konakanchi Cloud Infrastructure Engineering, Charles Schwab, Austin, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I4P128

Keywords:

Liquid Cooling, Leak Detection, LSTM, Random Forest, Energy Efficiency, Smart Iot, Green Data Centers, AI Data Centers, GB200, NVIDIA, GPU, Data Centers

Abstract

Modern GPU data centers supporting AI training workloads have increas- ingly adopted direct-to-chip liquid cooling systems to manage thermal loads exceeding 50 kW per rack, far beyond air cooling capabilities. However, coolant leaks in these high-density facilities result in substantial energy waste through unplanned shutdowns, extended repair periods, and preventive isolation of adjacent racks. We present a novel smart IoT monitoring system combining LSTM neural networks for probabilistic time- to-leak forecasting with Random Forest classifiers for real-time binary detection. The dual-model architecture provides both advance warning (2-4 hours) for planned mainte- nance and immediate alerts (sub-minute latency) for sudden failures. Validation using simulation-based data generation following ASHRAE 2021 specifications demonstrates strong performance: 96.5% F1-score for binary detection and 87% forecasting accu- racy at 90% probability within ±30-minute windows. The dataset comprises 72 hours of minute-resolution monitoring with realistic leak scenarios incorporating documented industry patterns. Statistical analysis reveals strong predictive signals from humidity (r = 0.70, p ¡ 0.001), pressure (r = -0.50), and flow rate, while temperature shows minimal immediate response (p = 0.236) due to thermal inertia, guiding optimal sen- sor deployment. The integrated system achieves 98.4% coverage with 850ms end-to-end latency. Energy analysis shows this approach could prevent approximately 1,500 kWh annual waste for a 47-rack facility, supporting sustainable operations. The complete implementation is provided to facilitate validation in operational environments, estab- lishing a foundation for intelligent leak management as liquid cooling becomes standard in AI infrastructure

Downloads

Download data is not yet available.

References

[1] Schneider Electric, ”The State of Data Center Cooling,” White Paper 342, 2023.[Online]. Available: https://www.se.com/

[2] T. Warren, ”Google Cloud outage caused by Paris cooling failure,” The Verge, 2019. [Online]. Available: https://www.theverge.com/

[3] Meta, ”Data Center Infrastructure Reliability Report,” 2022. [Online]. Available: https://engineering.fb.com/

[4] G. A. Susto, A. Schirru, S. Pampuri, S. McLoone, and A. Beghi, ”Machine learning for predictive maintenance: A multiple classifier approach,” IEEE Trans. Industrial Informatics, vol. 11, no. 3, pp. 812-820, 2015. [Online]. Available: https://doi.org/10.1109/TII.2014.2349359

[5] McKinsey & Company, ”Artificial intelligence in utility asset management,” Energy Insights, 2020. [Online]. Available: https://www.mckinsey.com/

[6] TTK, ”Leak Detection Systems for Data Centers,” Technical Documentation, 2023. [Online]. Available: https://www.ttk.fr/

[7] Sensaphone, ”Environmental Monitoring Solutions for Critical Facilities,” Product Guide, 2022. [Online]. Available: https://www.sensaphone.com/

[8] A. Aymon, A. Goldstein, and D. Cohen, ”Pipeline leak detection using Random Forest classification on pressure sensor data,” Water Resources Management, vol. 34, pp. 1453-1468, 2020. [Online]. Available: https://doi.org/10.1007/s11269-020-02507-4

[9] S. Choi and S. Im, ”Acoustic-based leak detection using deep convolutional neural networks,” Journal of Hydroinformatics, vol. 23, no. 2, pp. 367-381, 2021. [Online]. Available: https://doi.org/10.2166/hydro.2021.176

[10] M. A. Kammoun, I. Kammoun, B. Abid, and S. Masmoudi, ”Leak detection in wa- ter distribution networks using LSTM-based autoencoders,” IEEE Access, vol. 10, pp. 25308-25321, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3155634

[11] Y. Zhang, C. Xiong, and Y. Liu, ”Industrial equipment remaining useful life predic- tion using LSTM networks,” Reliability Engineering & System Safety, vol. 222, 108410, 2022. [Online]. Available: https://doi.org/10.1016/j.ress.2022.108410

[12] X. Zhu, L. Hou, and X. Chen, ”Hybrid LSTM-SVDD model for HVAC fault de- tection using prediction residuals,” Building and Environment, vol. 187, 107403, 2021. [Online]. Available: https://doi.org/10.1016/j.buildenv.2020.107403

[13] OASIS, ”MQTT Version 3.1.1 Specification,” OASIS Standard, 2014. [Online]. Available: https://mqtt.org/

[14] J. Wan, S. Tang, Z. Shu, D. Li, S. Wang, M. Imran, and A. V. Vasilakos, ”Software- defined industrial Internet of Things in Industry 4.0,” IEEE Wireless Communications, vol. 23, no. 5, pp. 137-143, 2016. [Online]. Available: https://doi.org/10.1109/MWC.2016.7721743

[15] InfluxData, ”InfluxDB Technical Overview: Time Series Data Platform,” Technical Documentation, 2023. [Online]. Available: https://www.influxdata.com/

[16] ASHRAE Technical Committee 9.9, ”Liquid Cooling Guidelines for Datacom Equip- ment Centers,” ASHRAE, 2021. [Online]. Available: https://www.ashrae.org/

[17] CoolIT Systems, ”Direct Liquid Cooling Design Guide for High-Performance Com- puting,” Engineering Manual, 2022. [Online]. Available: https://www.coolitsystems.com/

[18] Asetek, ”Liquid Cooling Solutions for Data Centers: Flow Rate Specifications,” Technical Brief, 2021. [Online]. Available: https://www.asetek.com/

[19] J. Hamilton, ”Perspectives on Large-Scale Data Center Operations and Cooling,” ACM Queue, vol. 8, no. 1, 2010. [Online]. Available: https://doi.org/10.1145/1717801.1717805

[20] U.S. Department of Energy, ”Best Practices Guide for Energy-Efficient Data Center Design: Liquid Cooling Failure Modes,” DOE/EE Technical Report, 2011. [Online]. Available: https://www.energy.gov/

[21] NVIDIA, ”DGX H100 System Architecture and Facility Requirements,” Technical Brief, 2023. [Online]. Available: https://www.nvidia.com/

[22] Uptime Institute, ”Data Center Power Density Trends,” Global Survey, 2024. [On- line]. Available: https://uptimeinstitute.com/

[23] CoolIT Systems, ”Mean Time to Repair for Direct-to-Chip Cooling Failures,” Ser- vice Documentation, 2023. [Online]. Available: https://www.coolitsystems.com/

[24] Meta, ”Cascading Failure Prevention in Liquid-Cooled Infrastructure,” Engineering Blog, 2023. [Online]. Available: https://engineering.fb.com/

[25] ASHRAE Technical Committee 9.9, ”Liquid Cooling Reliability Metrics for Data- com Equipment Centers,” ASHRAE Report, 2021. [Online]. Available: https://www.ashrae.org/

Published

2025-12-27

Issue

Section

Articles

How to Cite

1.
Sunkara KC, Konakanchi R. Dual-Model Machine Learning for Predictive Leak Forecasting and Detection in Liquid-Cooled AI Data Centers. IJETCSIT [Internet]. 2025 Dec. 27 [cited 2026 Jan. 28];6(4):176-82. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/534

Similar Articles

1-10 of 402

You may also start an advanced similarity search for this article.