Is the directory origin able to read new files in a directory (files are written in such directory in a regular time interval) or it reads all files present in the directory every times?

E.g: le me suppose we have 10 files and the origin reads them based on the LastModified Timestamp, after a minute a file will be added to the specified directory; Does StreamSets import only the last added file or all the files within the directory (11 in our case)?

1 Answer

Yes it is possible to read new files that are written in the specified directory .For your example in the first pipeline run it fetches the existing 10 files.Then when you add a new file it fetches only the newly added file based on the timestamp.

There were 23 files in the directory initially:

image description

when new file is being added to the same directory while the pipeline is still running count increases by 1 that is 23 to 24:

image description

After the pipeline is stopped when added a new file to the directory and the pipeline is ran for second time, count is 1.This means in the second run it fetches only the new files.

image description

Ok it works, unfortunately when pipeline runs I get the WARN: "File cannot be added to the queue:<file_name>; DirectorySpooler; directory-dirspooler-pool-20356-thread-1" and the INFO "sending no-more-data event. records 202619 errors 0 files 1" and then the INFO:"sending no-more-data event.

I have already read the doc but unfortunately I don't find a solution for this problem.

It's likely a file ordering problem. For example, if you are using lexicographical ordering, and you've already processed 005.csv, the pipeline will not process 004.csv, even if it is newer. Same for last modified ordering if you drop in an older file than the last processed.

