Synthetic Test Data Generation Using Generative Models
DOI:
https://doi.org/10.63282/3050-9246.IJETCSIT-V4I4P111Keywords:
Synthetic Data, Generative Adversarial Networks, Variational Autoencoders, Diffusion Models, Software Testing, Data Privacy, Test Data Generation, Differential PrivacyAbstract
The market needs on high-quality, privacy-compliant and scalable test data has grown exponentially as AI-based applications and the software testing needs have grown. Limits Common to Traditional Data Collection. Traditional data collection techniques have weaknesses associated with privacy issues, inadequate coverage of edge cases, and high costs of effort. A new solution to these challenges synthetic data generation via generative models has become a viable option. The aim of the paper is to investigate how recent advances in generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models, can be used to create synthetic test datasets that have statistical fidelity while also ensuring user privacy. Explain what architectural elements, training, and validation techniques were employed in building such models, with special consideration of maintaining data diversity and realism. The experimental findings indicate that modern generative models are capable of producing synthetic data that closely resembles the real-world distribution and can be used to substantially increase software test coverage, especially in covering edge cases and areas where compliance is relevant, such as finance and healthcare. Moreover, the combination of the differential privacy mechanisms proves the possibility of regulated and secure synthetic data pipelines. This paper highlights the advantages, challenges, and potential applications of generative models in synthetic data generation. These findings suggest that hybrid methods, which combine both synthetic and minimally obfuscated real data, are the most effective approach to strike a balance between realism, privacy, and practical usefulness in real-world testing situations
Downloads
References
[1] Arvanitis, T. N., White, S., Harrison, S., Chaplin, R., Despotou, G. "A method for machine learning generation of realistic synthetic datasets for validating healthcare applications." Digital Health, 2022. DOI: 10.1177/14604582221077000.
[2] Soltana, G., Sabetzadeh, M., & Briand, L. C. (2017, October). Synthetic data generation for statistical testing. In 2017, the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 872-882). IEEE.
[3] Figueira, Á., Vaz, B. "Survey on Synthetic Data Generation, Evaluation Methods and GANs." Mathematics, 2022. DOI: 10.3390/math10152733.
[4] Tan, C., Behjati, R., & Arisholm, E. (2019, April). A model-based approach to generate dynamic synthetic test data: A conceptual model. In 2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) (pp. 11-14). IEEE.
[5] Carvajal-Patiño, D., & Ramos-Pollán, R. (2022). Synthetic data generation with deep generative models to enhance predictive tasks in trading strategies. Research in International Business and Finance, 62, 101747.
[6] Endres, M., Mannarapotta Venugopal, A., & Tran, T. S. (2022, August). Synthetic data generation: A comparative study. In Proceedings of the 26th international database engineered applications symposium (pp. 94-102).
[7] Figueira, A., & Vaz, B. (2022). Survey on synthetic data generation, evaluation methods and GANs. Mathematics, 10(15), 2733.
[8] Gao, X., Zhang, Z. Y., & Duan, L. M. (2018). A quantum machine learning algorithm based on generative models. Science advances, 4(12), eaat9004.
[9] Salakhutdinov, R. (2015). Learning deep generative models. Annual Review of Statistics and Its Application, 2(1), 361-385.
[10] Xu, J., Li, H., & Zhou, S. (2015). An overview of deep generative models. IETE Technical Review, 32(2), 131-139.
[11] Namiot, D., & Ilyushin, E. (2022). Generative Models in Machine Learning. International Journal of Open Information Technologies, 10(7), 101-118.
[12] Oussidi, A., & Elhassouny, A. (2018, April). Deep generative models: Survey. In 2018 International conference on intelligent systems and computer vision (ISCV) (pp. 1-8). IEEE.
[13] Guo, X., Okamura, H., & Dohi, T. (2022). Automated software test data generation with generative adversarial networks. IEEE Access, 10, 20690-20700.
[14] Bachman, P. (2016). An architecture for deep, hierarchical generative models. Advances in Neural Information Processing Systems, 29.
[15] Zhang, L., Gonzalez-Garcia, A., Van De Weijer, J., Danelljan, M., & Khan, F. S. (2018). Synthetic data generation for end-to-end thermal infrared tracking. IEEE Transactions on Image Processing, 28(4), 1837-1850.
[16] Iantovics, L. B., & Enăchescu, C. (2022). Method for data quality assessment of synthetic industrial data. Sensors, 22(4), 1608.
[17] Chen, N., Klushyn, A., Kurle, R., Jiang, X., Bayer, J., & Smagt, P. (2018, March). Metrics for deep generative models. In International Conference on Artificial Intelligence and Statistics (pp. 1540-1550). PMLR.
[18] El Emam, K., Mosquera, L., Fang, X., & El-Hussuna, A. (2022). Utility metrics for evaluating synthetic health data generation methods: validation study. JMIR medical informatics, 10(4), e35734.
[19] Stadlmann, C., & Zehetner, A. (2022). Comparing AI-based and traditional prospect generating methods. Journal of Promotion Management, 28(2), 160-174.
[20] Dandekar, A., Zen, R. A., & Bressan, S. (2017). Comparative evaluation of synthetic data generation methods. In Proceedings of ACM Conference (Deep Learning Security Workshop).
[21] Pappula, K. K., & Anasuri, S. (2020). A Domain-Specific Language for Automating Feature-Based Part Creation in Parametric CAD. International Journal of Emerging Research in Engineering and Technology, 1(3), 35-44. https://doi.org/10.63282/3050-922X.IJERET-V1I3P105
[22] Rahul, N. (2020). Optimizing Claims Reserves and Payments with AI: Predictive Models for Financial Accuracy. International Journal of Emerging Trends in Computer Science and Information Technology, 1(3), 46-55. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I3P106
[23] Enjam, G. R., & Chandragowda, S. C. (2020). Role-Based Access and Encryption in Multi-Tenant Insurance Architectures. International Journal of Emerging Trends in Computer Science and Information Technology, 1(4), 58-66. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I4P107
[24] Pappula, K. K., & Anasuri, S. (2021). API Composition at Scale: GraphQL Federation vs. REST Aggregation. International Journal of Emerging Trends in Computer Science and Information Technology, 2(2), 54-64. https://doi.org/10.63282/3050-9246.IJETCSIT-V2I2P107
[25] Pedda Muntala, P. S. R. (2021). Integrating AI with Oracle Fusion ERP for Autonomous Financial Close. International Journal of AI, BigData, Computational and Management Studies, 2(2), 76-86. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V2I2P109
[26] Rahul, N. (2021). Strengthening Fraud Prevention with AI in P&C Insurance: Enhancing Cyber Resilience. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 2(1), 43-53. https://doi.org/10.63282/3050-9262.IJAIDSML-V2I1P106
[27] Enjam, G. R. (2021). Data Privacy & Encryption Practices in Cloud-Based Guidewire Deployments. International Journal of AI, BigData, Computational and Management Studies, 2(3), 64-73. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V2I3P108
[28] Pappula, K. K. (2022). Modular Monoliths in Practice: A Middle Ground for Growing Product Teams. International Journal of Emerging Trends in Computer Science and Information Technology, 3(4), 53-63. https://doi.org/10.63282/3050-9246.IJETCSIT-V3I4P106
[29] Jangam, S. K., Karri, N., & Pedda Muntala, P. S. R. (2022). Advanced API Security Techniques and Service Management. International Journal of Emerging Research in Engineering and Technology, 3(4), 63-74. https://doi.org/10.63282/3050-922X.IJERET-V3I4P108
[30] Anasuri, S., Rusum, G. P., & Pappula, kiran K. (2022). Blockchain-Based Identity Management in Decentralized Applications. International Journal of AI, BigData, Computational and Management Studies, 3(3), 70-81. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I3P109
[31] Pedda Muntala, P. S. R. (2022). Detecting and Preventing Fraud in Oracle Cloud ERP Financials with Machine Learning. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(4), 57-67. https://doi.org/10.63282/3050-9262.IJAIDSML-V3I4P107
[32] Rahul, N. (2022). Optimizing Rating Engines through AI and Machine Learning: Revolutionizing Pricing Precision. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(3), 93-101. https://doi.org/10.63282/3050-9262.IJAIDSML-V3I3P110
[33] Enjam, G. R., & Tekale, K. M. (2022). Predictive Analytics for Claims Lifecycle Optimization in Cloud-Native Platforms. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(1), 95-104. https://doi.org/10.63282/3050-9262.IJAIDSML-V3I1P110