Ask Your Question

Directory origin starts processing file before it is fully copied

asked 2020-10-03 06:53:55 -0500

Kunjutti gravatar image

I have an origin 'Directory' which reads records from a csv file and writes it to database. The problem is, while the pipeline is running, if I copy a file to that directory, streamsets starts processing that file before it is fully copied and it leaves out some records. Since offset is marked for that file, it never gets picked up again and many records in the file remain unprocessed. This was encountered with files with huge number of records and also the pipeline is running while it is copied

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2020-10-06 12:26:00 -0500

Tim gravatar image

Because a copy is not transactional in a file system, you must manage the transaction yourself. You should change the way files are placed in the directory being read. If your file system and method of access supports it, move the file into the target directory which results in a single transaction.

If you have no choice but to copy the file (e.g. when uploading through SFTP) then you should copy the file into a staging location or with a temporary name. Then rename the file to again make its appearance be a single transaction and touch it to update the date/timestamp.

This is required to work within the constraints of a file system. A partial copy can occur due to network error/disconnection during transmission. You must deliver the data being picked up by StreamSets in such a way as to avoid a lock or partial delivery.

edit flag offensive delete link more


Agree. I used Rsync to copy the files to the directory and the issue got resolved.

Kunjutti gravatar imageKunjutti ( 2020-10-08 06:27:47 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2020-10-03 06:53:55 -0500

Seen: 182 times

Last updated: Oct 06