Designing High-Throughput Data Pipelines: A Performance-Centric Architectural Framework for Low-Latency Analytics in Distributed Cloud Environments

Hitesh Jodhavat; Supreeth Meka; Beverly  DSouza

doi:10.63282/3050-9246.IJETCSIT-V6I2P107

Authors

Hitesh Jodhavat Performance Architect at Oracle Columbus, Ohio, United States. Author
Supreeth Meka Sales Planning and Strategy Consultant, Dell Technologies, USA. Author
Beverly DSouza Data Engineer, Patreon Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I2P107

Keywords:

High-throughput, Data pipeline, Cloud computing, Low-latency, Stream processing, Microservices, Kubernetes, Apache Kafka

Abstract

These days, using analytics in real-time with big datasets is vital to the design of smart systems and methods for making decisions. This work aims to provide a framework for efficient analytical processing, focusing on minimizing latency in distributed cloud environments. Because the amount of data produced by IoT devices, social media, transactional systems, and sensor networks is constantly rising, a capable and scalable system is needed to quickly handle and review this data. According to this research, existing data pipeline systems struggle with data ingestion delays, restricted use of different processors, slow database connections, and inability to use resources well. To solve these problems, we present a new architectural model that uses micro batching, asynchronous functioning, edge computing, and smart load distribution. The methodology has five layers to accomplish this, starting with data intake, processing streaming events, storage, and real-time data analytics. It is scalable and fault-tolerant with containers deployed using Kubernetes. Comparisons are made between traditional and new architectures on both real and simulated data using AWS, Azure, and GCP cloud services. Assessing a performance means looking at how fast the system works, its response time, how it uses resources, and how much it costs. The experiments show that the framework can reduce total latency by 45% and increase data throughput by 60% when measured against typical systems. This document features a thorough review of current literature, a well-structured design of the system, suggestions for building it, a look at how it performs, and directions for future research. Integrating Apache Kafka, Apache Flink, and TensorFlow Extended, the proposed framework allows businesses to build fast and agile data analytics platforms in the cloud

Downloads

Download data is not yet available.

References

[1] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (Vol. 11, No. 2011, pp. 1-7).

[2] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering, 38(4).

[3] Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012). Discretized streams: an efficient and {Fault-Tolerant} model for stream processing on large clusters. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 12).

[4] Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., ... & Whittle, S. (2015). The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12), 1792-1803.

[5] Villamizar, M., Garces, O., Ochoa, L., Castro, H., Salamanca, L., Verano, M., ... & Lang, M. (2016, May). Infrastructure cost comparison of web applications in the cloud using AWS lambda and monolithic and microservice architectures. In 2016, the 16th IEEE/ACM International Symposium on cluster, cloud, and grid computing (CCGrid) (pp. 179-182). IEEE.

[6] Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade. Queue, 14(1), 70-93.

[7] Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A. & Zumar, C. (2018). Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull., 41(4), 39-45.

[8] Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., & Stoica, I. (2017). Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (pp. 613-627).

[9] Cole, J. M. (2020). A design-to-device pipeline for data-driven materials discovery. Accounts of chemical research, 53(3), 599-610.

[10] Choudhary, J., & Sudarsan, C. S. (2023). A performance-centric ML-based multi-application mapping technique for regular network-on-chip. Memories-Materials, Devices, Circuits and Systems, 4, 100059.

[11] Mirmoeini, S. (2021). Karavan, ETL pipeline management system based on Apache Spark (Doctoral dissertation, ETSI_Informatica).

[12] Srivastava, R. (2021). Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition). BPB Publications.

[13] Ugwueze, V. (2024). Cloud Native Application Development: Best Practices and Challenges. International Journal of Research Publication and Reviews, 5, 2399-2412.

[14] Oyeniran, O. C., Adewusi, A. O., Adeleke, A. G., Akwawa, L. A., & Azubuko, C. F. (2024). Microservices architecture in cloud-native applications: Design patterns and scalability. International Journal of Advanced Research and Interdisciplinary Scientific Endeavours, 1(2), 92-106.

[15] Andreolini, M., Colajanni, M., & Pietri, M. (2012, December). A scalable architecture for real-time monitoring of large information systems. In 2012 Second Symposium on Network Cloud Computing and Applications (pp. 143-150). IEEE.

[16] Vítor, G., Rito, P., Sargento, S., & Pinto, F. (2022). A scalable smart city data platform approach: Support of real-time processing and data sharing. Computer Networks, 213, 109027.

[17] Vayghan, L. A., Saied, M. A., Toeroe, M., & Khendek, F. (2018, July). Deploying microservice-based applications with Kubernetes: Experiments and lessons learned. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD) (pp. 970-973). IEEE.

[18] Rajavaram, H., Rajula, V., & Thangaraju, B. (2019, July). Sundeck and Kubernetes make automation of microservices application deployment easy. In 2019 IEEE International Conference on Electronics, Computing and Communication Technologies (CONNECT) (pp. 1-3). IEEE.

[19] Huang, K., & Jumde, P. (2020). Learn Kubernetes Security: Securely orchestrate, scale, and manage your microservices in Kubernetes deployments. Packt Publishing Ltd.

[20] Mohammadian, V., Navimipour, N. J., Hosseinzadeh, M., & Darwesh, A. (2021). Fault-tolerant load balancing in cloud computing: A systematic literature review. IEEE Access, 10, 12714-12731.

Designing High-Throughput Data Pipelines: A Performance-Centric Architectural Framework for Low-Latency Analytics in Distributed Cloud Environments

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

callforpaper

Submission

Menu

Latest publications

Information

Reach US

Ethics and Policies

Important Links

Downloads & Indexing

Similar Articles

Microservices Architecture for Scalable Real-Time Data Processing at the Edge

Mitigating Algorithmic Complexity Attacks in Federated GraphQL Architectures: A Depth-Bounded Semantic Rate Limiting Approach for Open Banking

Redis Cache Optimization for Payment Gateways in the Cloud

Serverless Architectures for Scalable Data Analytics Workflows in Cloud BI Systems

Scalable End-to-End Encryption Management Using Quantum-Resistant Cryptographic Protocols for Cloud-Native Microservices Ecosystems

Data and Analytics Workflows for Decision Systems Enabled by Learning-Based RAN Intelligence across Distributed Computing Environments

ML-Based Risk Stratification of Patients Using Real-Time Clinical Streams on Cloud

Enterprise and RAN-Aware Data and Analytics Platforms for Mission-Critical and Low-Latency Digital Services

Edge Computing Architectures for Real-Time Distributed Processing

Serverless Cloud Engineering Methodologies for Scalable and Efficient Data Pipeline Architectures