Serverless Architectures for Scalable Data Analytics Workflows in Cloud BI Systems

Milan Gupta

doi:10.63282/3050-9246.IJETCSIT-V6I4P134

Authors

Milan Gupta Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I4P134

Keywords:

Serverless Computing, Cloud Bi Systems, Scalable Data Analytics, Event-Driven Architectures, Distributed Data Processing, Workflow Orchestration, Elastic Resource Management, Function-As-A-Service (Faas), Multi-Tenant Analytics Platforms, Observability In Serverless Systems, Cost-Aware Execution, Cloud-Native Data Pipelines, Fault Tolerance And Reliability, Performance Optimization, Metadata-Driven Analytics

Abstract

Cloud Business Intelligence (BI) systems increasingly require processing ever-growing datasets with agility and minimal operational overhead. Serverless computing – epitomized by services like AWS Lambda – has emerged as a promising architecture for scalable data analytics workflows. This paper explores how serverless architectures can be leveraged to build scalable, cost-efficient, and agile data analytics pipelines on the AWS cloud, integrating theoretical underpinnings with industry case studies and performance comparisons. We discuss the paradigm shift from traditional cluster-based ETL and data warehousing to Function-as-a-Service (FaaS) and fully-managed services, highlighting benefits such as automatic scaling, fine-grained billing, and reduced DevOps burden. At the same time, we address challenges including cold-start latency, statelessness, data shuffling overhead, and orchestration complexity, along with recent research advances aimed at mitigating these issues. We present a reference serverless BI architecture on AWS (utilizing services like AWS Lambda, Glue, S3, Athena, and QuickSight) and an event-driven analytics pipeline design, complete with system diagrams. Through deep analysis of academic studies and practical benchmarks, we show that serverless analytics workflows can achieve high scalability and throughput (e.g., processing hundreds of millions of records in minutes) with significantly lower operational complexity and cost – often 4–10× cheaper than equivalent always-on clusters. We also review performance evaluations demonstrating that, with careful design, serverless pipelines can handle large payloads (100+ MB events) with consistent execution times. In conclusion, serverless architectures represent a compelling next step for cloud BI, offering a path to self-service, on-demand analytics at scale, while ongoing innovations (e.g., optimized data orchestration and hybrid runtime models) continue to expand their applicability for larger, stateful, and latency-sensitive workloads.

Downloads

Download data is not yet available.

References

[1] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the Cloud: Distributed Computing for the 99%,” Proc. ACM SoCC, 2017 [https://arxiv.org/html/2507.11929v1#:~:text=parallelism,ACM].

[2] Q. Pu, S. Venkataraman, and I. Stoica, “Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure,” Proc. USENIX NSDI, 2019 [https://arxiv.org/html/2507.11929v1#:~:text=Madden,USENIX%20NSDI].

[3] I. Müller, R. Marroquín, and G. Alonso, “Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure,” Proc. ACM SIGMOD, 2020 [https://arxiv.org/html/2507.11929v1#:~:text=,Castro%C2%A0Fernandez%2C%20David%20DeWitt%2C%20and%20Samuel].

[4] M. Perron, R. C. Fernandez, D. DeWitt, and S. Madden, “Starling: A Scalable Query Engine on Cloud Functions,” Proc. ACM SIGMOD, 2020 [https://arxiv.org/html/2507.11929v1#:~:text=In%20Proc,ACM%20SIGMOD%2C%202020].

[5] C. Jin, Z. Zhang, X. Xiang, et al., “Ditto: Efficient Serverless Analytics with Elastic Parallelism,” Proc. ACM SIGCOMM, 2023 [https://arxiv.org/html/2507.11929v1#:~:text=,Venkataraman%2C%20Ion%20Stoica%2C%20and%20Benjamin].

[6] A. Pogiatzis and G. Samakovitis, “An Event-Driven Serverless ETL Pipeline on AWS,” Applied Sciences, vol. 11, no. 1, p. 191, 2021 [https://www.mdpi.com/2076-3417/11/1/191#:~:text=Pogiatzis%2C%20A,3390%2Fapp11010191].

[7] D. Vuppu and M. Achanta, “Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing,” Int. J. Computer Trends and Technology, vol. 73, no. 7, pp. 73–80, 2025 [https://www.ijcttjournal.org/2025/Volume-73/Issue-7/IJCTT-V73I7P109.pdf#:~:text=Serverless%20ETL%3A%20Leveraging%20AWS%20Glue,Accepted%3A%2018%20July%202025%20Published].

[8] P. Kava, R. Babu, and C. Gong, “AWS Serverless Data Analytics Pipeline Reference Architecture,” AWS Big Data Blog, 28 Oct 2020 (reviewed May 2025) [https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/#:~:text=AWS%20serverless%20data%20analytics%20pipeline,reference%20architecture].

[9] J. Schleier-Smith, V. Sreekanti, et al., “What Serverless Computing Is and Should Become: The Next Phase of Cloud Computing,” Commun. ACM, vol. 64, no. 5, pp. 76–84, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=%2A%20%20%5B39%5D%20Johann%20Schleier,ACM%2C%2064%285%29%3A76%E2%80%9384%2C%202021].

[10] A. Klimovic, Y. Wang, et al., “Pocket: Elastic Ephemeral Storage for Serverless Analytics,” Proc. USENIX OSDI, 2018 [https://arxiv.org/html/2507.11929v1#:~:text=SoCC%2C%202019.%20,Symposium%20on%20Cloud%20Computing%2C%202023].

[11] J. M. Hellerstein, J. Faleiro, J. Gonzalez, et al., “Serverless Computing: One Step Forward, Two Steps Back,” arXiv:1812.03651, 2018 [https://www.mdpi.com/2076-3417/11/1/191#:~:text=15,Google%20Scholar].

[12] B. Carver, J. Zhang, et al., “Wukong: A Scalable and Locality-Enhanced Framework for Serverless Parallel Computing,” Proc. ACM SoCC, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=SoCC%2C%202019.%20,ACM%20SoCC%2C%202021].

[13] T. Li, Y. Li, et al., “MinFlow: High-Performance and Cost-Efficient Data Passing for I/O-Intensive Stateful Serverless Analytics,” Proc. USENIX FAST, 2024 [https://arxiv.org/html/2507.11929v1#:~:text=Computing%2C%202023.%20,Chen%20Chen%2C%20and%20Minyi%20Guo].

[14] J. Roig, “Serverless Analytics, Part 1: Cheap and Scalable Terabyte-level Analytics,” Medium, Oct 11, 2022 [https://medium.com/@jvroig/serverless-analytics-part-1-cheap-and-scalable-terabyte-level-analytics-bd5e6a64ab46#:~:text=Press%20enter%20or%20click%20to,view%20image%20in%20full%20size].

[15] H. Zhang, Y. Tang, et al., “Caerus: Nimble Task Scheduling for Serverless Analytics,” Proc. USENIX NSDI, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=,SHEPHERD].

[16] Serverless Analytics, Part 1: Cheap and Scalable Terabyte-level Analytics! | by JV Roig | Medium https://medium.com/@jvroig/serverless-analytics-part-1-cheap-and-scalable-terabyte-level-analytics-bd5e6a64ab46

[17] An Event-Driven Serverless ETL Pipeline on AWS https://www.mdpi.com/2076-3417/11/1/191

[18] Making Serverless Computing Extensible: A Case Study of Serverless Data Analytics https://arxiv.org/html/2507.11929v1

[19] AWS serverless data analytics pipeline reference architecture | AWS Big Data Blog https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/

[20] Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing https://www.ijcttjournal.org/2025/Volume-73/Issue-7/IJCTT-V73I7P109.pdf

Serverless Architectures for Scalable Data Analytics Workflows in Cloud BI Systems

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

callforpaper

Submission

Menu

Latest publications

Information

Reach US

Ethics and Policies

Important Links

Downloads & Indexing

Similar Articles

Serverless Cloud Engineering Methodologies for Scalable and Efficient Data Pipeline Architectures

Event-Based Microservices Architectures for High-Throughput Sanctions Compliance

Enterprise and RAN-Aware Data and Analytics Platforms for Mission-Critical and Low-Latency Digital Services

Multi-Cloud Serverless Computing & FaaS Architectures for Resilient and Cost-Efficient Systems

Event Driven API Automation for Microservices and Server less Architectures

Mitigating Algorithmic Complexity Attacks in Federated GraphQL Architectures: A Depth-Bounded Semantic Rate Limiting Approach for Open Banking

Software Architecture Optimization Techniques for Enterprise CRM Performance Enhancement

Serverless Cloud Solutions for Scalable and Efficient AI Model Management

Predictive Customer Experience Orchestration Using Governed Data Pipelines and Intelligent Service Signals

Real-Time AI Integration Architectures for HIPAA-Compliant Healthcare Data Interoperability