Extracting incremental data from API

asked 2019-01-21 22:08:49 -0600

suman gravatar image

updated 2019-01-23 16:06:59 -0600

metadaddy gravatar image

Hi All,

I am new to StreamSets. I am in the process of developing a pipeline to extract JSON data from a REST API. We have to land data in a data lake (AWS S3 in JSON format) first and from there data will be loaded in Redshift.

I am trying to extract Incremental data from the API so that the API call will only extract data modified after maximum modified date captured in previous run. For example if, for user API resource, last pipeline has extracted data modified till 2019-01-21 7:00:00, next pipeline run should try to fetch data modified after 2019-01-21 7:00:00.

The API supports this kind of get request: GET /1/healthscores?modifiedfrom=2019-01-21 7:00:00.

I have defined this as parameter for now and embedded in URL

URL - ${URL}/${table_name}?${filter}=${filter_offset} Parameter value -

${URL} = https://api.xxxxx.aon.com/1
${table_name} = users
${filter} = modifiedfrom
${filter_offset} = 2019-01-21 7:00:00

Currently I have hard coded ${filter_offset} parameter.

My question is how do I set {filter_offset} parameter from Max(modifiedfrom) from last ETL run for every API resource (in this case resource is users). I can store data in Redshift staging table after data lake is there any way that I can set {filter_offset} parameter with MAX( modifiedfrom) from stg_user table.

edit retag flag offensive close merge delete

Comments

Hi! Are you using HTTP Client origin?

iamontheinet gravatar imageiamontheinet ( 2019-01-23 15:54:55 -0600 )edit

Yes I am using HTTP Client origin

suman gravatar imagesuman ( 2019-01-29 21:26:09 -0600 )edit
iamontheinet gravatar imageiamontheinet ( 2019-01-29 22:25:08 -0600 )edit