You could certainly do this at the files level with StreamSets using the Hadoop FS origin and destination. The pipeline would run in Cluster Batch mode, so you would have multiple instances of SDC running on the cluster, and you would get the benefits of being able to apply transformations to the data in flight if necessary.