Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Batch size for Google Cloud Storage,JDBC Consumer

Hello All,

New to StreamSets :)

We am trying a use case that requires reading data from Database and writing to an object in Google Cloud Storage. For this purpose, we built a pipeline with JDBC Consumer as Origin Stage and Google Cloud Storage as Destination Stage. As we wanted to have one object/file in GCS bucket with all the data in it, we tried to make Origin Max Batch Size as exceptionally high ( 999999999) hoping all data( which is expected to be less than 1 million most of the time) will be pulled as part of 1 batch and hence only 1 write into GCS object. However, we noticed that batch happens to pick 100K records at a time. So, if sql query pulls has 300,001 records, there are 4 batches executed before no-more-date event bring down Pipeline to completion.

1) Is there a way that we can force only 1 batch irrespective of number of records pulled from origin? That way pipeline create just 1 file in GCS per execution.

2) Object name for GCS seems to be written in format <prefix>_<uuid>.<optional suffix=""> as per documentation. Is there a way we can append date format to the prefix , like YYYY_MM_DD ? E.g. mytestfile_2019_09_25_<uuid>.txt

Thanks for your help in advance!