Extracting incremental data from API
Hi All,
I am new to StreamSets. I am in the process of developing a pipeline to extract JSON data from a REST API. We have to land data in a data lake (AWS S3 in JSON format) first and from there data will be loaded in Redshift.
I am trying to extract Incremental data from the API so that the API call will only extract data modified after maximum modified date captured in previous run. For example if, for user API resource, last pipeline has extracted data modified till 2019-01-21 7:00:00, next pipeline run should try to fetch data modified after 2019-01-21 7:00:00.
The API supports this kind of get request: GET /1/healthscores?modifiedfrom=2019-01-21 7:00:00
.
I have defined this as parameter for now and embedded in URL
URL - ${URL}/${table_name}?${filter}=${filter_offset}
Parameter value -
${URL} = https://api.xxxxx.aon.com/1
${table_name} = users
${filter} = modifiedfrom
${filter_offset} = 2019-01-21 7:00:00
Currently I have hard coded ${filter_offset}
parameter.
My question is how do I set {filter_offset}
parameter from Max(modifiedfrom) from last ETL run for every API resource (in this case resource is users). I can store data in Redshift staging table after data lake is there any way that I can set {filter_offset}
parameter with MAX( modifiedfrom) from stg_user table.
Hi! Are you using HTTP Client origin?
Yes I am using HTTP Client origin
Have already checked out different pagination options in the docs https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Origins/HTTPClient.html?hl=pagination?