Serverless Architectures for Scalable Data Analytics Workflows in Cloud BI Systems
DOI:
https://doi.org/10.63282/3050-9246.IJETCSIT-V6I4P134Keywords:
Serverless Computing, Cloud Bi Systems, Scalable Data Analytics, Event-Driven Architectures, Distributed Data Processing, Workflow Orchestration, Elastic Resource Management, Function-As-A-Service (Faas), Multi-Tenant Analytics Platforms, Observability In Serverless Systems, Cost-Aware Execution, Cloud-Native Data Pipelines, Fault Tolerance And Reliability, Performance Optimization, Metadata-Driven AnalyticsAbstract
Cloud Business Intelligence (BI) systems increasingly require processing ever-growing datasets with agility and minimal operational overhead. Serverless computing – epitomized by services like AWS Lambda – has emerged as a promising architecture for scalable data analytics workflows. This paper explores how serverless architectures can be leveraged to build scalable, cost-efficient, and agile data analytics pipelines on the AWS cloud, integrating theoretical underpinnings with industry case studies and performance comparisons. We discuss the paradigm shift from traditional cluster-based ETL and data warehousing to Function-as-a-Service (FaaS) and fully-managed services, highlighting benefits such as automatic scaling, fine-grained billing, and reduced DevOps burden. At the same time, we address challenges including cold-start latency, statelessness, data shuffling overhead, and orchestration complexity, along with recent research advances aimed at mitigating these issues. We present a reference serverless BI architecture on AWS (utilizing services like AWS Lambda, Glue, S3, Athena, and QuickSight) and an event-driven analytics pipeline design, complete with system diagrams. Through deep analysis of academic studies and practical benchmarks, we show that serverless analytics workflows can achieve high scalability and throughput (e.g., processing hundreds of millions of records in minutes) with significantly lower operational complexity and cost – often 4–10× cheaper than equivalent always-on clusters. We also review performance evaluations demonstrating that, with careful design, serverless pipelines can handle large payloads (100+ MB events) with consistent execution times. In conclusion, serverless architectures represent a compelling next step for cloud BI, offering a path to self-service, on-demand analytics at scale, while ongoing innovations (e.g., optimized data orchestration and hybrid runtime models) continue to expand their applicability for larger, stateful, and latency-sensitive workloads.
Downloads
References
[1] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the Cloud: Distributed Computing for the 99%,” Proc. ACM SoCC, 2017 [https://arxiv.org/html/2507.11929v1#:~:text=parallelism,ACM].
[2] Q. Pu, S. Venkataraman, and I. Stoica, “Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure,” Proc. USENIX NSDI, 2019 [https://arxiv.org/html/2507.11929v1#:~:text=Madden,USENIX%20NSDI].
[3] I. Müller, R. Marroquín, and G. Alonso, “Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure,” Proc. ACM SIGMOD, 2020 [https://arxiv.org/html/2507.11929v1#:~:text=,Castro%C2%A0Fernandez%2C%20David%20DeWitt%2C%20and%20Samuel].
[4] M. Perron, R. C. Fernandez, D. DeWitt, and S. Madden, “Starling: A Scalable Query Engine on Cloud Functions,” Proc. ACM SIGMOD, 2020 [https://arxiv.org/html/2507.11929v1#:~:text=In%20Proc,ACM%20SIGMOD%2C%202020].
[5] C. Jin, Z. Zhang, X. Xiang, et al., “Ditto: Efficient Serverless Analytics with Elastic Parallelism,” Proc. ACM SIGCOMM, 2023 [https://arxiv.org/html/2507.11929v1#:~:text=,Venkataraman%2C%20Ion%20Stoica%2C%20and%20Benjamin].
[6] A. Pogiatzis and G. Samakovitis, “An Event-Driven Serverless ETL Pipeline on AWS,” Applied Sciences, vol. 11, no. 1, p. 191, 2021 [https://www.mdpi.com/2076-3417/11/1/191#:~:text=Pogiatzis%2C%20A,3390%2Fapp11010191].
[7] D. Vuppu and M. Achanta, “Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing,” Int. J. Computer Trends and Technology, vol. 73, no. 7, pp. 73–80, 2025 [https://www.ijcttjournal.org/2025/Volume-73/Issue-7/IJCTT-V73I7P109.pdf#:~:text=Serverless%20ETL%3A%20Leveraging%20AWS%20Glue,Accepted%3A%2018%20July%202025%20Published].
[8] P. Kava, R. Babu, and C. Gong, “AWS Serverless Data Analytics Pipeline Reference Architecture,” AWS Big Data Blog, 28 Oct 2020 (reviewed May 2025) [https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/#:~:text=AWS%20serverless%20data%20analytics%20pipeline,reference%20architecture].
[9] J. Schleier-Smith, V. Sreekanti, et al., “What Serverless Computing Is and Should Become: The Next Phase of Cloud Computing,” Commun. ACM, vol. 64, no. 5, pp. 76–84, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=%2A%20%20%5B39%5D%20Johann%20Schleier,ACM%2C%2064%285%29%3A76%E2%80%9384%2C%202021].
[10] A. Klimovic, Y. Wang, et al., “Pocket: Elastic Ephemeral Storage for Serverless Analytics,” Proc. USENIX OSDI, 2018 [https://arxiv.org/html/2507.11929v1#:~:text=SoCC%2C%202019.%20,Symposium%20on%20Cloud%20Computing%2C%202023].
[11] J. M. Hellerstein, J. Faleiro, J. Gonzalez, et al., “Serverless Computing: One Step Forward, Two Steps Back,” arXiv:1812.03651, 2018 [https://www.mdpi.com/2076-3417/11/1/191#:~:text=15,Google%20Scholar].
[12] B. Carver, J. Zhang, et al., “Wukong: A Scalable and Locality-Enhanced Framework for Serverless Parallel Computing,” Proc. ACM SoCC, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=SoCC%2C%202019.%20,ACM%20SoCC%2C%202021].
[13] T. Li, Y. Li, et al., “MinFlow: High-Performance and Cost-Efficient Data Passing for I/O-Intensive Stateful Serverless Analytics,” Proc. USENIX FAST, 2024 [https://arxiv.org/html/2507.11929v1#:~:text=Computing%2C%202023.%20,Chen%20Chen%2C%20and%20Minyi%20Guo].
[14] J. Roig, “Serverless Analytics, Part 1: Cheap and Scalable Terabyte-level Analytics,” Medium, Oct 11, 2022 [https://medium.com/@jvroig/serverless-analytics-part-1-cheap-and-scalable-terabyte-level-analytics-bd5e6a64ab46#:~:text=Press%20enter%20or%20click%20to,view%20image%20in%20full%20size].
[15] H. Zhang, Y. Tang, et al., “Caerus: Nimble Task Scheduling for Serverless Analytics,” Proc. USENIX NSDI, 2021 [https://arxiv.org/html/2507.11929v1#:~:text=,SHEPHERD].
[16] Serverless Analytics, Part 1: Cheap and Scalable Terabyte-level Analytics! | by JV Roig | Medium https://medium.com/@jvroig/serverless-analytics-part-1-cheap-and-scalable-terabyte-level-analytics-bd5e6a64ab46
[17] An Event-Driven Serverless ETL Pipeline on AWS https://www.mdpi.com/2076-3417/11/1/191
[18] Making Serverless Computing Extensible: A Case Study of Serverless Data Analytics https://arxiv.org/html/2507.11929v1
[19] AWS serverless data analytics pipeline reference architecture | AWS Big Data Blog https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/
[20] Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing https://www.ijcttjournal.org/2025/Volume-73/Issue-7/IJCTT-V73I7P109.pdf
