Problems download file every 60 seconds by polling or scheduling

asked 2019-03-01 06:44:48 -0600

frank gravatar image

Hi all,

I want to download a xml-file every 60 seconds, extract some data and insert that data after normalization into a postgresql database. The file is automatically changed by the webserver every 60 seconds. I setup the pipeline http client --> ... some data transformation ... --> jdbc lookup --> jdbc producer which works fine in case of a single run but I struggle to run it every 60 seconds.

I tried the following approaches:

  1. I configured a 60 seconds polling in the http client. Unfortunately this is the time that is waited after the last request completed. The request itself takes a time that cannot be neglected. I observed missing data every sixth run. The entire pipeline takes 90 seconds for the first run and all subsequent runs take about 20 seconds. This is because the cache of the jdbc lookup is filled in the first run and can be used in all subsequent runs.
  2. I tried to use the REST-API to start the pipeline every 60 seconds. Unfortunately running the pipeline once takes approx. 90 seconds. As a consequence I was unable to run it again after 60 seconds. In this setting the cache is filled every run but cannot be used in subsequent runs since these are "new runs".

How can I overcome these difficulties?

edit retag flag offensive close merge delete