Directory origin starts processing file before it is fully copied

asked 2020-10-03

Kunjutti

I have an origin 'Directory' which reads records from a csv file and writes it to database. The problem is, while the pipeline is running, if I copy a file to that directory, streamsets starts processing that file before it is fully copied and it leaves out some records. Since offset is marked for that file, it never gets picked up again and many records in the file remain unprocessed. This was encountered with files with huge number of records and also the pipeline is running while it is copied

1 Answer

answered 2020-10-06

Tim

Because a copy is not transactional in a file system, you must manage the transaction yourself. You should change the way files are placed in the directory being read. If your file system and method of access supports it, move the file into the target directory which results in a single transaction.

If you have no choice but to copy the file (e.g. when uploading through SFTP) then you should copy the file into a staging location or with a temporary name. Then rename the file to again make its appearance be a single transaction and touch it to update the date/timestamp.

This is required to work within the constraints of a file system. A partial copy can occur due to network error/disconnection during transmission. You must deliver the data being picked up by StreamSets in such a way as to avoid a lock or partial delivery.

Agree. I used Rsync to copy the files to the directory and the issue got resolved.

Kunjutti ( 2020-10-08 )
Asked: 2020-10-03 06:53:55 -0500

