Ask Your Question

Don't know when mapreduce has stopped converting avro files into parquet

asked 2018-11-20 07:26:02 -0600

anonymous user


updated 2018-11-20 07:28:08 -0600


My goal is to create a pipeline that does the following:

1)Ingest data from mysql to a temporary location in hdfs in avro format

2)Convert avro files to parquetwith a map reduce executor as explained here

3) Validate with a query if the number of ingested rows is correct and if validation is ok then

4) Ingest data from mysql to a permanent location in hdfs again in parquet

The problem is that in step 2 I don't know when all files are converted in parquet in order to continue to step 3, since streamsets can't monitor mapreduce jobs.

Could you help me fix this problem? Is there an alternative way to send data in HDFS in parquet format?

Thank you!

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2020-02-07 13:43:14 -0600

subie gravatar image

We had a similar problem. We wanted to automatically execute an impala query through the Hive Query stage AFTER the parquet conversion job completed. Your first instinct will be to produce events off of the Map Reduce stage... but that triggers an event when the job STARTS. So, we ended up setting mapreduce.job.end-notification.url as a MapReduce Configuration. The callback URL can then go to a REST_API (microservice pipeline) that ends up triggering the Hive Query stage.

Since you have some additional validation and not just a simple invalidate metadata table or refresh table, you would probably have to add more info to the query string in the notification url (e.g. the number of records expected). But the first step is probably the mapreduce.job.end-notification.url.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-11-20 07:26:02 -0600

Seen: 28 times

Last updated: Feb 07