Ask Your Question

Hadoop FS close file every hour?

asked 2019-02-25 10:34:40 -0600

supahcraig gravatar image

I have a pipeline which more or less has a continuous stream of data flowing through it. I currently have the max file size set to 500mb and idle time of 5 min, but the idle time only triggers if the pipeline stops; there is never a 5 minute idle period while running.

What I would like to happen is to simply say "close the file after an hour of writing." I realize I could handle this with the directory template, but that is currently configured to create a new directory every month, which matches up with my partitioning scheme. I don't actually want a directory for every hour of the day, I just want Streamsets to close out the file every hour so I can query the external table for that data (I have an external table with partitions for each month).

Or am I approaching this problem completely incorrectly?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2019-02-25 11:43:38 -0600

metadaddy gravatar image

updated 2019-02-25 16:05:11 -0600

Partitioning is the way to go here, I think - the file idle time/max size parameters are for more operational concerns, while partitioning is going to give you exactly what you need, albeit at the expense of a deeper directory structure.

You might be able to use Dataflow Triggers with an executor to automate what you're trying to do. Here are a couple of useful links:

edit flag offensive delete link more


So then my problem becomes adding the partitions to the table every day + hour...unsure how to automate that via streamsets. Also, this problem is n-fold for me because the TOP level of my partitioning is "client name" which could be one of like 15+ values. Cleary I need a programmatic solution.

supahcraig gravatar imagesupahcraig ( 2019-02-25 14:21:37 -0600 )edit

Are you wanting to automate adding data to a Hive table?

metadaddy gravatar imagemetadaddy ( 2019-02-25 15:19:36 -0600 )edit

I don't think so, but maybe? Currently I'm landing avro files and then have an external which looks at those files via partitions I add every month. I'm open to other options, I'm doing it this way because I understand it. If you have another/better way, I'm all ears; teach me. :)

supahcraig gravatar imagesupahcraig ( 2019-02-25 15:48:27 -0600 )edit

I edited some useful links into my answer. Hope they help point you in the right direction!

metadaddy gravatar imagemetadaddy ( 2019-02-25 16:05:39 -0600 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2019-02-25 10:34:40 -0600

Seen: 249 times

Last updated: Feb 25 '19