Not sure if you already saw these tutorials so here they are:

On another note, it seems like you could really benefit from newly released product -- StreamSets Transformer. It has built-in capabilities, such as, joining datasets from multiple sources/origins and performing complex transformations like aggregations, sorting, ranking, etc. across the entire dataset. You can also extend its capabilities by writing custom Scala and PySpark code. For detailed documentation, click here.

Cheers, Dash