Automated Data Lake Onboarding: A Hybrid Metadata and Content-Based Approach for Schema Matching

Sai Prashanth Pathi

doi:10.63282/3050-9246.IJETCSIT-V7I1P121

Authors

Sai Prashanth Pathi Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V7I1P121

Keywords:

Machine Learning, Schema Matching, Data Lake, Automated Onboarding, K-Nearest Neighbors, Metadata Management, Gap Analysis, Enterprise Information Integration

Abstract

In large-scale enterprise environments, onboarding disparate asset teams to a centralized Customer Data Lake (CDL) or similar enterprise data repository is often bottlenecked by manual Gap Analysis. This process involves comparing legacy asset tables with the central lake to identify schema overlaps and missing information, a task complicated by non-standardized column naming and vast data volumes. This paper proposes an automated framework for schema matching that utilizes a hybrid approach. We employ a metadata module using Jaro-Winkler string similarity with a domain-specific abbreviation dictionary, alongside a data content module utilizing K-Nearest Neighbors (KNN) classification on feature-engineered embeddings. A unique feedback loop allows the data model to iteratively improve the metadata dictionary. Experimental results demonstrate an F1 score of approximately 90.7%, significantly reducing manual mapping efforts and streamlining the onboarding process.

Downloads

Download data is not yet available.

References

[1] E. Rahm and P. A. Bernstein, "A survey of approaches to automatic schema matching," The VLDB Journal, vol. 10, no. 4, pp. 334–350, 2001.

[2] W. E. Winkler, "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359, 1990.

[3] J. Madhavan, P. A. Bernstein, and E. Rahm, "Generic schema matching with Cupid," in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001, pp. 49–58.

[4] A. Doan, P. Domingos, and A. Y. Halevy, "Reconciling schemas of disparate data sources: A machine-learning approach," in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001, pp. 509–520.

[5] Y. Zhang, J. He, and J. Liu, "Schema Matching using Machine Learning," in International Conference on Computer Science, 2011.

[6] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.

[7] H. Nottelmann and N. Fuhr, "Evaluating different methods of estimating retrieval quality for IIR," in Proceedings of the 26th Annual International ACM SIGIR Conference, 2003.

[8] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu, "Data Curation at Scale: The Data Tamer System," in CIDR, 2013.

[9] R. Dhamdhere, H. Pethe, and A. K. N. N., "Managing Data Lakes at Scale," IEEE Data Engineering Bulletin, 2019.

[10] F. Panse, A. Griewank, and N. Ritter, "Schema-based Data Deduplication," IEEE Transactions on Knowledge and Data Engineering, 2020.

Automated Data Lake Onboarding: A Hybrid Metadata and Content-Based Approach for Schema Matching

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

callforpaper

Submission

Menu

Latest publications

Information

Reach US

Ethics and Policies

Important Links

Downloads & Indexing

Similar Articles

Hybrid AI-Oriented DevSecOps Architecture for Intelligent Multi-Cloud Enterprise Platforms

AI-Driven Privacy Engineering: Architectures for Protecting PII in Multi-Cloud and Federated Data Ecosystems

AI-Driven Security Automation for Continuous Compliance Monitoring in Regulated Cloud Environments

Credit Risk Management Practices in Microfinance Institutions

Adopting HITRUST and AI for Securing Healthcare Data: A Blueprint for U.S. Medical Facilities

Automating Higher Education Administrative Processes with AI-Powered Workflows

Zero Trust Architecture for Modern Enterprise Networks: A Practical Solution Framework

Cloud-Native Micro services Architecture

Accelerating Defect and Vulnerability Discovery with ML + HPC: High-Throughput Simulation Analytics for Software Quality Engineering

A Polyglot Data Integration Framework for Seamless Integration of Heterogeneous Data Sources and Formats