Ask Your Question
1

Hive Streaming Support with Parquet File Format

asked 2018-07-19 09:44:44 -0600

ravi gravatar image

updated 2018-07-20 15:12:58 -0600

metadaddy gravatar image

From Kafka we wanted to write the data into HDFS as Parquet File Format which is mapped to a Hive Table. To avoid small file issues using Hive Streaming is an Option. However it supports only ORC File Format (this is not a StreamSets limitation, indeed it is a Hive Streaming limitation).

Kafka has HDFS Sink Connector to achieve this. Do we have similar options in StreamSets to achieve this?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2018-07-20 15:12:10 -0600

metadaddy gravatar image

updated 2019-01-11 10:24:52 -0600

Use the Hive Metadata processor with the Hadoop FS and Hive Metastore processors as detailed in Drift Synchronization Solution for Hive. The Hadoop FS destination can be tuned to create large files, so you should not have any issues, and you can trigger a MapReduce job to convert the Avro output to Parquet.

See the section Timeout to Close Idle Files for details on how to tune the Hadoop FS destination to create larger files. Increasing one or more of 'Idle Timeout', 'Max Records in a File' and 'Max File Size' should result in larger files.

edit flag offensive delete link more

Comments

Great! So, metadaddy ( cool name ) , wold you tell me more about this "Hadoop FS Tunning"? From my perspective the problem with the hive streaming is the same of the Ravi, the small files in HDFS. So, can you refer a link ?

Mathias Brem gravatar imageMathias Brem ( 2019-01-11 06:56:19 -0600 )edit

Hi Mathias - I added a paragraph explaining how to make Hadoop FS files bigger

metadaddy gravatar imagemetadaddy ( 2019-01-11 10:25:21 -0600 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-07-19 09:44:44 -0600

Seen: 164 times

Last updated: Jan 11