Efficient Bulk File Ingestion into Data Lake using SMB and FTP protocols

Authors

  • Vamshi Krishna Malthummeda Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V7I1P113

Keywords:

SFTP, SMB Protocol, SSH, Databricks, Data Lake, Pyspark, Python

Abstract

It is very essential for companies in the retail industry to exchange sensitive information like Purchase Orders, Invoices, Payments Information, Contracts and other types of information for improving efficiency, reducing costs, and for strengthening business relationships. The sensitive information also gets exchanged between different teams within an organization for better collaboration, efficiency and to extract valuable insights. This paper introduces a cost effective and secure approach which is tailored for bulk ingestion into databricks data lake using SFTP and SMB protocols. The proposed file transfer framework makes use of python packages like paramiko (which implements the Secure Shell SSHv2 protocol for secure file transfer over unsecured network) and pysmb(which implements SMB protocol for secure file sharing across different operating systems within a network) to transfer the files into the databricks volume for further processing by transformation pipelines in databricks cluster. Using this methodology various types of file transfers like synchronous small file transfer, asynchronous large file transfer and continuous file transfer can be achieved. Deletion of files on the remote server after a configured number of successful transfers is also achieved using the proposed methodology. The key findings of this implementation are as following: simplified setup procedure, less development effort and decent performance. The findings suggest that the proposed framework with data integrity and remote file management capabilities is a secure, simplified, versatile and a reliable choice for transferring sensitive files.

Downloads

Download data is not yet available.

References

[1] Shivaraj, G. (2024). OPTIMIZING REBATE MANAGEMENT IN SUPPLY CHAIN OPERATIONS. Technology (IJARET), 15(3), 110-118.

[2] Bomma, H. P. (2021). Navigating the Challenges of Data Encryption and Compliance Regulations: FTP vs. SFTP. International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences, 9(5), 1-6.

[3] Ts, J., Eckstein, R., & Collier-Brown, D. (2003). Using Samba. " O'Reilly Media, Inc.".

[4] Behler, J. A. C. (2023). Assessing Python Bindings of C Libraries with Respect to Python Idiomatic Conformance (Master's thesis, Kent State University).

[5] Welcome to Paramiko’s documentation! Paramiko documentation.

Published

2026-02-04

Issue

Section

Articles

How to Cite

1.
Malthummeda VK. Efficient Bulk File Ingestion into Data Lake using SMB and FTP protocols. IJETCSIT [Internet]. 2026 Feb. 4 [cited 2026 Feb. 12];7(1):97-100. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/563

Similar Articles

121-130 of 328

You may also start an advanced similarity search for this article.