Designing High-Throughput Data Pipelines: A Performance-Centric Architectural Framework for Low-Latency Analytics in Distributed Cloud Environments

Authors

  • Hitesh Jodhavat Performance Architect at Oracle Columbus, Ohio, United States. Author
  • Supreeth Meka Sales Planning and Strategy Consultant, Dell Technologies, USA. Author
  • Beverly DSouza Data Engineer, Patreon Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I2P107

Keywords:

High-throughput, Data pipeline, Cloud computing, Low-latency, Stream processing, Microservices, Kubernetes, Apache Kafka

Abstract

These days, using analytics in real-time with big datasets is vital to the design of smart systems and methods for making decisions. This work aims to provide a framework for efficient analytical processing, focusing on minimizing latency in distributed cloud environments. Because the amount of data produced by IoT devices, social media, transactional systems, and sensor networks is constantly rising, a capable and scalable system is needed to quickly handle and review this data. According to this research, existing data pipeline systems struggle with data ingestion delays, restricted use of different processors, slow database connections, and inability to use resources well. To solve these problems, we present a new architectural model that uses micro batching, asynchronous functioning, edge computing, and smart load distribution. The methodology has five layers to accomplish this, starting with data intake, processing streaming events, storage, and real-time data analytics. It is scalable and fault-tolerant with containers deployed using Kubernetes. Comparisons are made between traditional and new architectures on both real and simulated data using AWS, Azure, and GCP cloud services. Assessing a performance means looking at how fast the system works, its response time, how it uses resources, and how much it costs. The experiments show that the framework can reduce total latency by 45% and increase data throughput by 60% when measured against typical systems. This document features a thorough review of current literature, a well-structured design of the system, suggestions for building it, a look at how it performs, and directions for future research. Integrating Apache Kafka, Apache Flink, and TensorFlow Extended, the proposed framework allows businesses to build fast and agile data analytics platforms in the cloud

Downloads

Download data is not yet available.

References

[1] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (Vol. 11, No. 2011, pp. 1-7).

[2] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering, 38(4).

[3] Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012). Discretized streams: an efficient and {Fault-Tolerant} model for stream processing on large clusters. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 12).

[4] Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., ... & Whittle, S. (2015). The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12), 1792-1803.

[5] Villamizar, M., Garces, O., Ochoa, L., Castro, H., Salamanca, L., Verano, M., ... & Lang, M. (2016, May). Infrastructure cost comparison of web applications in the cloud using AWS lambda and monolithic and microservice architectures. In 2016, the 16th IEEE/ACM International Symposium on cluster, cloud, and grid computing (CCGrid) (pp. 179-182). IEEE.

[6] Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade. Queue, 14(1), 70-93.

[7] Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A. & Zumar, C. (2018). Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull., 41(4), 39-45.

[8] Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., & Stoica, I. (2017). Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (pp. 613-627).

[9] Cole, J. M. (2020). A design-to-device pipeline for data-driven materials discovery. Accounts of chemical research, 53(3), 599-610.

[10] Choudhary, J., & Sudarsan, C. S. (2023). A performance-centric ML-based multi-application mapping technique for regular network-on-chip. Memories-Materials, Devices, Circuits and Systems, 4, 100059.

[11] Mirmoeini, S. (2021). Karavan, ETL pipeline management system based on Apache Spark (Doctoral dissertation, ETSI_Informatica).

[12] Srivastava, R. (2021). Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition). BPB Publications.

[13] Ugwueze, V. (2024). Cloud Native Application Development: Best Practices and Challenges. International Journal of Research Publication and Reviews, 5, 2399-2412.

[14] Oyeniran, O. C., Adewusi, A. O., Adeleke, A. G., Akwawa, L. A., & Azubuko, C. F. (2024). Microservices architecture in cloud-native applications: Design patterns and scalability. International Journal of Advanced Research and Interdisciplinary Scientific Endeavours, 1(2), 92-106.

[15] Andreolini, M., Colajanni, M., & Pietri, M. (2012, December). A scalable architecture for real-time monitoring of large information systems. In 2012 Second Symposium on Network Cloud Computing and Applications (pp. 143-150). IEEE.

[16] Vítor, G., Rito, P., Sargento, S., & Pinto, F. (2022). A scalable smart city data platform approach: Support of real-time processing and data sharing. Computer Networks, 213, 109027.

[17] Vayghan, L. A., Saied, M. A., Toeroe, M., & Khendek, F. (2018, July). Deploying microservice-based applications with Kubernetes: Experiments and lessons learned. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD) (pp. 970-973). IEEE.

[18] Rajavaram, H., Rajula, V., & Thangaraju, B. (2019, July). Sundeck and Kubernetes make automation of microservices application deployment easy. In 2019 IEEE International Conference on Electronics, Computing and Communication Technologies (CONNECT) (pp. 1-3). IEEE.

[19] Huang, K., & Jumde, P. (2020). Learn Kubernetes Security: Securely orchestrate, scale, and manage your microservices in Kubernetes deployments. Packt Publishing Ltd.

[20] Mohammadian, V., Navimipour, N. J., Hosseinzadeh, M., & Darwesh, A. (2021). Fault-tolerant load balancing in cloud computing: A systematic literature review. IEEE Access, 10, 12714-12731.

Published

2025-04-28

Issue

Section

Articles

How to Cite

1.
Jodhavat H, Meka S, DSouza B. Designing High-Throughput Data Pipelines: A Performance-Centric Architectural Framework for Low-Latency Analytics in Distributed Cloud Environments. IJETCSIT [Internet]. 2025 Apr. 28 [cited 2025 Jul. 14];6(2):56-62. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/221

Similar Articles

1-10 of 181

You may also start an advanced similarity search for this article.