Working with very large files (>100GB)

asked 2018-09-12 07:47:28 -0600

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.


I am just starting to evaluate StreamSets. I apologize in advance if this question has been answered before on this forum.

Our data flow use case requires us to ingest data through a bunch of very large files, upwards of 300GB in some cases. We will run for basic validations without reading entire content, and then push the chunks of file data to Spark for content validation and transformations.

It will be great if anyone can shed light on handling of very large files in StreamSets. Files are mostly CSVs. Has there been any benchmarking done in this regards? Also if someone can point to the best practices, it would be awesome.

Many thanks, Ajit

edit retag flag offensive close merge delete