Don't know when mapreduce has stopped converting avro files into parquet

asked 2018-11-20 07:26:02 -0600

anonymous user


updated 2018-11-20 07:28:08 -0600


My goal is to create a pipeline that does the following:

1)Ingest data from mysql to a temporary location in hdfs in avro format

2)Convert avro files to parquetwith a map reduce executor as explained here

3) Validate with a query if the number of ingested rows is correct and if validation is ok then

4) Ingest data from mysql to a permanent location in hdfs again in parquet

The problem is that in step 2 I don't know when all files are converted in parquet in order to continue to step 3, since streamsets can't monitor mapreduce jobs.

Could you help me fix this problem? Is there an alternative way to send data in HDFS in parquet format?

Thank you!

edit retag flag offensive close merge delete