Just retrieve the file-urls from an HDFS directory - possible?

asked 2018-10-18 04:35:31 -0600

pwel gravatar image

updated 2018-10-18 06:45:28 -0600


currently, I use the HadoopStandalone origin for just identifying existing files on a AzureBlobStore container - which generally works fine. I defined a WholeFile transfer into a Trash destination and grabbed the Events from the HadoopSA origin to enrich and forward the identified files to Kafka.

Unfortunately, it requires > 5 minutes just to startup if there are e.g. 2000 files existing in the origin (spooling starts after 2 minutes and requires 3-4 minutes). And I need to load some hundred thousand files later on in order to make an initial load in other pipelines. I fear this might take hours or days just to start - and I cannot stop a starting pipeline - hence I hesitate to test this.

I wonder if there is a possibility just to read the listing much faster with pure StreamSets features? In comparison: Just listing 6374 files with PowerShell takes 5 seconds.

Thanks a lot for your help in advance! Peter

edit retag flag offensive close merge delete