Automated Data Lake Onboarding: A Hybrid Metadata and Content-Based Approach for Schema Matching
DOI:
https://doi.org/10.63282/3050-9246.IJETCSIT-V7I1P121Keywords:
Machine Learning, Schema Matching, Data Lake, Automated Onboarding, K-Nearest Neighbors, Metadata Management, Gap Analysis, Enterprise Information IntegrationAbstract
In large-scale enterprise environments, onboarding disparate asset teams to a centralized Customer Data Lake (CDL) or similar enterprise data repository is often bottlenecked by manual Gap Analysis. This process involves comparing legacy asset tables with the central lake to identify schema overlaps and missing information, a task complicated by non-standardized column naming and vast data volumes. This paper proposes an automated framework for schema matching that utilizes a hybrid approach. We employ a metadata module using Jaro-Winkler string similarity with a domain-specific abbreviation dictionary, alongside a data content module utilizing K-Nearest Neighbors (KNN) classification on feature-engineered embeddings. A unique feedback loop allows the data model to iteratively improve the metadata dictionary. Experimental results demonstrate an F1 score of approximately 90.7%, significantly reducing manual mapping efforts and streamlining the onboarding process.
Downloads
References
[1] E. Rahm and P. A. Bernstein, "A survey of approaches to automatic schema matching," The VLDB Journal, vol. 10, no. 4, pp. 334–350, 2001.
[2] W. E. Winkler, "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359, 1990.
[3] J. Madhavan, P. A. Bernstein, and E. Rahm, "Generic schema matching with Cupid," in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001, pp. 49–58.
[4] A. Doan, P. Domingos, and A. Y. Halevy, "Reconciling schemas of disparate data sources: A machine-learning approach," in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001, pp. 509–520.
[5] Y. Zhang, J. He, and J. Liu, "Schema Matching using Machine Learning," in International Conference on Computer Science, 2011.
[6] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
[7] H. Nottelmann and N. Fuhr, "Evaluating different methods of estimating retrieval quality for IIR," in Proceedings of the 26th Annual International ACM SIGIR Conference, 2003.
[8] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu, "Data Curation at Scale: The Data Tamer System," in CIDR, 2013.
[9] R. Dhamdhere, H. Pethe, and A. K. N. N., "Managing Data Lakes at Scale," IEEE Data Engineering Bulletin, 2019.
[10] F. Panse, A. Griewank, and N. Ritter, "Schema-based Data Deduplication," IEEE Transactions on Knowledge and Data Engineering, 2020.
