Ask Your Question
0

Whether components are idle in a batch

asked 2020-03-30 23:00:09 -0500

shixinbao gravatar image

image description

In nifi, the data source component will transfer the data to the next component after the data source component reads a batch of data, then the data source component will continue to read the next batch of data. However, I found that streamsets does not seem to work like this from the monitoring screen. The data source will wait for other components of the pipeline to finish processing the batch of data after reading a batch of data from the data source. Then the data source read the next batch of data. If I'm right, I want to know why streamsets doesn't make a nifi like processing scheme

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted
0

answered 2020-04-24 16:56:59 -0500

jeff gravatar image

The reason for the behavior you're describing is that Data Collector tracks offsets for the origin. The offsets can't be safely committed until that batch is completely flushed (i.e. completed writing) by the destination(s).

That being said, there are likely various steps you can take to increase the efficiency of the pipeline you have screenshotted. First, the Jython evaluator (as with any scripting processor) is quite slow, so if any of its logic can be replaced by native processors, it should be much faster. Also, the JDBC multitable origin supports multithreaded partitioning, if the underlying table has a suitable structure (single numeric offset or key column).

edit flag offensive delete link more

Comments

Thank you for your answer. I want to know what are the advantages and disadvantages of this way and the way of nifi processing

shixinbao gravatar imageshixinbao ( 2020-04-28 21:54:31 -0500 )edit
0

answered 2020-04-23 01:45:28 -0500

shixinbao gravatar image

For example: this batch has 1000 data, the processing of component A in the pipeline takes 5 seconds, the processing of component B takes 10 seconds, and the processing of 1000 data in the pipeline requires at least 15 seconds. This efficiency is very low

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2020-03-30 23:00:09 -0500

Seen: 26 times

Last updated: Apr 24