Performance issue while executing multiple pipelines writing to HDFS

Hello,

In our production environment, we have set up five StreamSets instances (version 3.4.0) at different hosts and implemented pipelines that consume data from Kafka, perform transformations, and export the results to HDFS. Our production pipelines are executed at three of the StreamSets instances, while the remaining two are used for failover purposes. For each production instance of StreamSets, we have set the heap size at 150 GBs and the runner.thread.pool.size equal to 750.

We have observed that when all production pipelines are executed (around 70 pipelines), if we start additional ones, they get stuck at the "STARTING" state for multiple minutes. Moreover, the throughput of several production pipelines occasionally becomes very low.

We have monitored the threads of the pipelines that get stuck at "STARTING" state using jvisualVM and it seems that they are blocked while trying to authenticate to HDFS as shown in the following stacktrace messages.

Additionally, we have tried to execute the pipelines that get stuck from the failover StreamSets instances and they work properly achieving a high throughput.

We would like to ask if there is a configuration we can perform or if there is anything else we should investigate in order to resolve the above performance issue.

2019-10-01 17:38:13

- waiting to lock <38b6eff4> (a javax.security.auth.Subject) owned by "ProductionPipelineRunnable-MZBSGN2ef152daa-70d0-4959-a72d-46c7f211db7e-MZ_BS_GN_2" t@1192
at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:209) at org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:205)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:408) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:401)
at com.streamsets.pipeline.stage.destination.hdfs.writer ...