Ask Your Question
0

frequently reading ftp/sftp origin

asked 2017-10-25 08:27:19 -0500

anonymous user

Anonymous

My pipeline is currently reading files from an FTP source/ origin, performing transformations and writing them to a JDBC producer destination. The problem with the FTP origin is however, that it works only for the files which are in the specified directory, at the time when the pipeline starts running. Neither the FTP, nor the JDBC producer seem to give me any events to finish the pipeline properly (so that i could restart it via another pipeline when a change occurs).

Is there a way to watch frequently for new files and to feed them into the running pipeline? Maybe via the start event -> shell script or JS evaluator / ...

Thanks for your help

edit retag flag offensive close merge delete

Comments

The SFTP/FTP client 'watches' the configured FTP directory for changes, so, as long as the pipeline is running, it should 'see' new files. Do you see any errors in sdc.log? Is stopping and restarting the pipeline enough to read the new files, or do you have to reset the offset?

metadaddy gravatar imagemetadaddy ( 2017-10-25 17:14:22 -0500 )edit

Thank both of you for the help so far. I did check the logs and the following file after the first, did get read. The problem however is, that no records of the second file appear in my pipeline.

msnej gravatar imagemsnej ( 2017-10-26 14:54:32 -0500 )edit

After i changed the data slightly i am able to load the first file and a small portion of the second one. In the docs the only criteria i could find, was the last modified timestamp but then shouldnt the comlete file get loaded or not, instead of a few records only?

msnej gravatar imagemsnej ( 2017-10-26 14:55:48 -0500 )edit

2 Answers

Sort by ยป oldest newest most voted
0

answered 2017-12-15 13:26:15 -0500

supaxi gravatar image

updated 2017-12-18 09:08:50 -0500

I'm having a similar problem. If I start the FTP collector and add new files, SDC says it was read but just sits there doing nothing. If I restart the pipeline, it will work as expected but I might need to reset the origin if some of the existing files are still in there. Maybe this will be fixed in the next version when we can move processed files to another directory.

Also, is it possible to get the file timestamp?

edit: I was able to get the timestamp using ${record:attribute('mtime')}

edit flag offensive delete link more

Comments

Ok, I found if you copy over a file that the streamsets pointer is already monitoring, you will get an error. The solution I came up with is to just touch a temp file in the directory before copying to move the pointer off of the file I need to read.

supaxi gravatar imagesupaxi ( 2018-01-09 08:51:00 -0500 )edit
0

answered 2017-10-25 17:17:16 -0500

hshreedharan gravatar image

metadaddy is right. The origin does watch the origin for changes. Please check your logs for errors - that should help with debugging.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-10-25 08:27:19 -0500

Seen: 211 times

Last updated: Dec 18 '17