Automated Data Cleaning: AI Methods for Enhancing Data Quality and Consistency

Authors

  • Mr. Rahul Cherekar Software Development Manager, Chewy, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V5I1P104

Keywords:

Automated Data Cleaning, Artificial Intelligence, Machine Learning, Data Quality, Consistency, Data Preprocessing, Error Detection, Data Imputation

Abstract

Data cleaning is a data preprocessing step that involves finding and dealing with errors to enhance data quality. It is therefore important to pay particular attention to the quality of data as its poor quality brings about the wrong conclusions when making analyses. Using AI techniques for automatic data detecting and cleaning in big data is one of the most efficient and rapid solutions that can be implemented to reach high data reliability. This paper analyses different methodologies for data cleaning using artificial intelligence, namely, rule cycling, machine learning, deep learning, and a combination of the three. As seen in this paper, these techniques have their strengths, making their application ideal and their weaknesses, which should be considered before implementation. Also, we provide practical examples in healthcare, finance, and business intelligence, which only proves the efficiency of data cleaning via AI tools. The experimental results indicate that all four implemented approaches increase the data quality and decrease the manual work done by human beings. In this paper, the latest AI application techniques in the data-cleaning process have been critically discussed to guide researchers and practitioners

Downloads

Download data is not yet available.

References

[1] Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. Proceedings of the 2016 International Conference on Management of Data (SIGMOD), 2201–2206. https://doi.org/10.1145/2882903.2912574

[2] Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.

[3] Raman, V., & Hellerstein, J. M. (2001). Potter’s Wheel: An interactive data cleaning system. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 381–390.

[4] Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 1-58.

[5] Hellerstein, J. M. (2013). Quantitative data cleaning for large databases.

[6] Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016, June). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data (pp. 2201-2206).

[7] Ilyas, I. F., & Chu, X. (2019). Data cleaning. Morgan & Claypool.

[8] Jain, A., Patel, H., Nagalapatti, L., Gupta, N., Mehta, S., Guttula, S., ... & Munigala, V. (2020, August). Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 3561-3562).

[9] Tayi, G. K., & Ballou, D. P. (1998). Examining data quality. Communications of the ACM, 41(2), 54-57.

[10] Rekatsinas, T., Chu, X., Ilyas, I. F., & Ré, C. (2017). HoloClean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, 10(11), 1190–1201. https://doi.org/10.14778/3115404.3115412

[11] Liu, H., Zhong, C., Alnusair, A., & Islam, S. R. (2021). FAIXID: A framework for enhancing AI explainability of intrusion detection results using data cleaning techniques. Journal of network and systems management, 29(4), 40.

[12] Borrohou, S., Fissoune, R., & Badir, H. (2023). Data cleaning survey and challenges–improving outlier detection algorithm in machine learning. Journal of Smart Cities and Society, 2(3), 125-140.

[13] Ganti, V., & Sarma, A. D. (2022). Data Cleaning. Springer Nature.

[14] Liebchen, G. A. (2010). Data cleaning techniques for software engineering data sets (Doctoral dissertation, Brunel University, School of Information Systems, Computing and Mathematics).

[15] Tae, K. H., Roh, Y., Oh, Y. H., Kim, H., & Whang, S. E. (2019, June). Data cleaning for accurate, fair, and robust models: A big data-AI integration approach. In Proceedings of the 3rd International Workshop on Data Management for End-to-end Machine Learning (pp. 1-4).

[16] Gami, S. J., Remala, R., & Mudunuru, K. R. AI-Driven Adaptive Data Cleansing: Automating Error Detection and Correction for Dynamic Datasets.

[17] Mavrogiorgos, K., Kiourtis, A., Mavrogiorgou, A., Kleftakis, S., & Kyriazis, D. (2022, January). A multi-layer approach for data cleaning in the healthcare domain. In Proceedings of the 2022 8th International Conference on Computing and Data Engineering (pp. 22-28).

[18] Krishnan, S., Franklin, M. J., Goldberg, K., Wang, J., & Wu, E. (2016, June). Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 International Conference on Management of Data (pp. 2117-2120).

[19] Leema, A. A., & Hemalatha, M. (2011). An effective and adaptive data cleaning technique for colossal RFID data sets in healthcare. WSEAS Transactions on Information Science and Applications, 8(6), 243-252.

[20] Dasu, T., & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. Wiley-Interscience. ISBN: 978-0471268510.

[21] Cherekar, R. (2020). DataOps and Agile Data Engineering: Accelerating Data-Driven Decision-Making. International Journal of Emerging Research in Engineering and Technology, 1(1), 31-39. https://doi.org/10.63282/3050-922X.IJERET-V1I1P104

[22] Cherekar, R. (2020). The Future of Data Governance: Ethical and Legal Considerations in AI-Driven Analytics. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(2), 53-60. https://doi.org/10.63282/3050-9262.IJAIDSML-V3I2P107

[23] R. Daruvuri, “An improved AI framework for automating data analysis,” World Journal of Advanced Research and Reviews, vol. 13, no. 1, pp. 863–866, Jan. 2022, doi: 10.30574/wjarr.2022.13.1.0749.

[24] Cherekar, R. (2022). Cloud Data Governance: Policies, Compliance, and Ethical Considerations. International Journal of AI, BigData, Computational and Management Studies, 3(2), 24-31. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I2P103

[25] Cherekar, R. (2021). The Future of AI Quality Assurance: Emerging Trends, Challenges, and the Need for Automated Testing Frameworks. International Journal of Emerging Trends in Computer Science and Information Technology, 2(1), 19-27. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I2P104

[26] Cherekar, R. (2020). Cloud-Based Big Data Analytics: Frameworks, Challenges, and Future Trends. International Journal of AI, Big Data, Computational and Management Studies, 1(1), 31-39. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V4I1P107

[27] Cherekar, R. (2023). A Comprehensive Framework for Quality Assurance in Artificial Intelligence: Methodologies, Standards, and Best Practices. International Journal of Emerging Research in Engineering and Technology, 4(2), 43-51. https://doi.org/10.63282/3050-922X.IJERET-V4I2P105

Published

2024-03-31

Issue

Section

Articles

How to Cite

1.
Cherekar R. Automated Data Cleaning: AI Methods for Enhancing Data Quality and Consistency. IJETCSIT [Internet]. 2024 Mar. 31 [cited 2025 Sep. 14];5(1):31-40. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/151

Similar Articles

11-20 of 239

You may also start an advanced similarity search for this article.