Automated Data Lake Onboarding: A Hybrid Metadata and Content-Based Approach for Schema Matching

Authors

  • Sai Prashanth Pathi Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V7I1P121

Keywords:

Machine Learning, Schema Matching, Data Lake, Automated Onboarding, K-Nearest Neighbors, Metadata Management, Gap Analysis, Enterprise Information Integration

Abstract

In large-scale enterprise environments, onboarding disparate asset teams to a centralized Customer Data Lake (CDL) or similar enterprise data repository is often bottlenecked by manual Gap Analysis. This process involves comparing legacy asset tables with the central lake to identify schema overlaps and missing information, a task complicated by non-standardized column naming and vast data volumes. This paper proposes an automated framework for schema matching that utilizes a hybrid approach. We employ a metadata module using Jaro-Winkler string similarity with a domain-specific abbreviation dictionary, alongside a data content module utilizing K-Nearest Neighbors (KNN) classification on feature-engineered embeddings. A unique feedback loop allows the data model to iteratively improve the metadata dictionary. Experimental results demonstrate an F1 score of approximately 90.7%, significantly reducing manual mapping efforts and streamlining the onboarding process.

Downloads

Download data is not yet available.

References

[1] E. Rahm and P. A. Bernstein, "A survey of approaches to automatic schema matching," The VLDB Journal, vol. 10, no. 4, pp. 334–350, 2001.

[2] W. E. Winkler, "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359, 1990.

[3] J. Madhavan, P. A. Bernstein, and E. Rahm, "Generic schema matching with Cupid," in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001, pp. 49–58.

[4] A. Doan, P. Domingos, and A. Y. Halevy, "Reconciling schemas of disparate data sources: A machine-learning approach," in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001, pp. 509–520.

[5] Y. Zhang, J. He, and J. Liu, "Schema Matching using Machine Learning," in International Conference on Computer Science, 2011.

[6] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.

[7] H. Nottelmann and N. Fuhr, "Evaluating different methods of estimating retrieval quality for IIR," in Proceedings of the 26th Annual International ACM SIGIR Conference, 2003.

[8] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu, "Data Curation at Scale: The Data Tamer System," in CIDR, 2013.

[9] R. Dhamdhere, H. Pethe, and A. K. N. N., "Managing Data Lakes at Scale," IEEE Data Engineering Bulletin, 2019.

[10] F. Panse, A. Griewank, and N. Ritter, "Schema-based Data Deduplication," IEEE Transactions on Knowledge and Data Engineering, 2020.

Published

2026-02-12

Issue

Section

Articles

How to Cite

1.
Pathi SP. Automated Data Lake Onboarding: A Hybrid Metadata and Content-Based Approach for Schema Matching. IJETCSIT [Internet]. 2026 Feb. 12 [cited 2026 Feb. 26];7(1):146-9. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/590

Similar Articles

11-20 of 475

You may also start an advanced similarity search for this article.