Ask Your Question

Hive Streaming Support with Parquet File Format

asked 2018-07-19 09:44:44 -0600

ravi gravatar image

updated 2018-07-20 15:12:58 -0600

metadaddy gravatar image

From Kafka we wanted to write the data into HDFS as Parquet File Format which is mapped to a Hive Table. To avoid small file issues using Hive Streaming is an Option. However it supports only ORC File Format (this is not a StreamSets limitation, indeed it is a Hive Streaming limitation).

Kafka has HDFS Sink Connector to achieve this. Do we have similar options in StreamSets to achieve this?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2018-07-20 15:12:10 -0600

metadaddy gravatar image

Use the Hive Metadata processor with the Hadoop FS and Hive Metastore processors as detailed in Drift Synchronization Solution for Hive. The Hadoop FS destination can be tuned to create large files, so you should not have any issues, and you can trigger a MapReduce job to convert the Avro output to Parquet.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-07-19 09:44:44 -0600

Seen: 91 times

Last updated: Jul 20