A Survey of Petabyte-Scale Data Architectures for Self-Serve Generative AI: From Foundational Systems to Intelligent Abstraction and In-Database Optimization

Authors

  • Pinaki Bose Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I3P107

Keywords:

Generative AI, Data Architecture, Data Lakehouse, Semantic Layer, Knowledge Graph, Petabyte-Scale, Self-Service Analytics, Large Language Models (LLMs)

Abstract

The integration of generative AI (GenAI) into self-service analytics platforms, such as Microsoft Power BI Copilot and Amazon Q, has necessitated a paradigm shift in data architecture. While these tools aim to democratize data access through natural language interfaces, their efficacy is contingent on an underlying data foundation that can manage the challenges of petabyte-scale data. This paper presents a survey and comparative analysis of three advanced architectural strategies designed to address the issues of query latency, cost, and semantic ambiguity inherent in such large datasets. The strategies examined are: (1) the Optimized Data Lakehouse, which focuses on foundational performance and cost-efficiency through open-source formats and high-performance query engines; (2) the Enterprise Semantic Layer and Knowledge Graph, an abstraction-first approach that ensures data consistency and mitigates AI hallucination by providing a structured context; and (3) the Instance-Optimized LLM (IOLM-DB), a specialized, in-database method that overcomes the high cost of per-row AI inference. A detailed analysis reveals that no single solution is universally optimal. Instead, the most effective approach for many organizations is a phased, hybrid architecture that combines the strengths of the first two strategies, reserving the third for specific, high-value applications. This framework provides a robust roadmap for building an AI-native data platform that is both performant at scale and semantically intelligent

Downloads

Download data is not yet available.

References

[1] Microsoft, "Prepare your data, your semantic model, and your users for Copilot for Power BI," learn.microsoft.com. Available: https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-semantic-models.

[2] Seisma Group, "Smarter Data with Microsoft Fabric Copilot for Power BI," seismagroup.com. Available: https://www.seismagroup.com/news/smarter-data-with-microsoft-fabric-copilot-for-power-bi.

[3] Silicon Republic, "Artificial semantic layer is a missing piece of the GenAI puzzle," siliconrepublic.com. Available: https://www.siliconrepublic.com/enterprise/business-intelligence-artificial-semantic-layer-genai-data.

[4] S. Saifi, "Beyond Snowflake: What Actually Happens When You Query Petabytes of Data," Medium, 2024. Available: https://medium.com/@sohail_saifi/beyond-snowflake-what-actually-happens-when-you-query-petabytes-of-data-7bd57cbc52df.

[5] T. O'Sullivan, "From BigQuery to Lakehouse: How We Built a Petabyte-Scale Data Analytics Platform," trmlabs.com. Available: https://www.trmlabs.com/resources/blog/from-bigquery-to-lakehouse-how-we-built-a-petabyte-scale-data-analytics-platform-part-1.

[6] B-Eye, "Modern Data Platform Blueprint," b-eye.com. Available: https://b-eye.com/blog/modern-data-platform-blueprint/.

[7] ClickHouse, "Use cases: Machine Learning and Data Science," clickhouse.com. Available: https://clickhouse.com/use-cases/machine-learning-and-data-science.

[8] H. PMP, "Building the Complete Modern Enterprise Data Architecture: A Comprehensive Guide," Medium, 2023. Available: https://hamidpmp.medium.com/building-the-complete-modern-enterprise-data-architecture-a-comprehensive-guide-2c48f003942b.

[9] A. Johnson, "The Ultimate Guide to Semantic Layers for AI," promptql.io. Available: https://promptql.io/blog/the-ultimate-guide-to-semantic-layers-for-ai.

[10] Orange Business, "How AI is transforming self-service analytics and BI," perspective.orange-business.com. Available: https://perspective.orange-business.com/en/how-ai-is-transforming-self-service-analytics-and-bi-and-what-you-need-to-get-right-first/.

[11] Enterprise Knowledge, "Data Management Trends in 2022: Data Fabric v. Data Mesh v. DataOps," enterprise-knowledge.com. Available: https://enterprise-knowledge.com/data-management-trends-in-2022-data-fabric-v-data-mesh-v-dataops-what-is-right-for-your-organization/.

[12] Ontotext, "How Knowledge Graphs Power Data Mesh and Data Fabric," ontotext.com. Available: https://www.ontotext.com/blog/how-knowledge-graphs-power-data-mesh-and-data-fabric/.

[13] Ontoforce, "Gartner: Semantic Technologies Take Center Stage in 2025," ontoforce.com. Available: https://www.ontoforce.com/blog/gartner-semantic-technologies-take-center-stage-in-2025.

[14] Amazon Web Services, "Amazon Q: The generative AI assistant for business," aws.amazon.com. Available: https://aws.amazon.com/q/.

[15] A. Z. et al., "The Case for Instance-Optimized LLMs in OLAP Databases," arXiv preprint arXiv:2507.04967v1, 2025. Available: https://arxiv.org/html/2507.04967v1.

[16] A. G. et al., "The Case for Instance-Optimized LLMs in OLAP Databases," ResearchGate. Available: https://www.researchgate.net/publication/393477387_The_Case_for_Instance-Optimized_LLMs_in_OLAP_Databases.

[17] "Decoding LangChain's Structured LLM Calls for Model Fine-Tuning," Medium, 2024. Available: https://blog.gopenai.com/decoding-langchains-structured-llm-calls-for-model-fine-tuning-eaea34710783.

[18] Data Science Collective, "Comprehensive Guide to Fine-Tuning LLM," Medium, 2024. Available: https://medium.com/data-science-collective/comprehensive-guide-to-fine-tuning-llm-4a8fd4d0e0af

Published

2025-08-09

Issue

Section

Articles

How to Cite

1.
Bose P. A Survey of Petabyte-Scale Data Architectures for Self-Serve Generative AI: From Foundational Systems to Intelligent Abstraction and In-Database Optimization. IJETCSIT [Internet]. 2025 Aug. 9 [cited 2025 Sep. 18];6(3):43-7. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/361

Similar Articles

11-20 of 263

You may also start an advanced similarity search for this article.