Get filenames inside zip archive?

asked 2018-03-22 14:31:18 -0500

cimpicci gravatar image

I'm trying to move data from a series of zipped .DAT files into Hive. The table name in hive should be based on the .DAT file's name. Here is an example file structure for one .zip:
--- abc.DAT
--- def.DAT

... etc.

I'm currently using an expression evaluator to add some metadata columns, and I'm trying to add the name of the dat file. ( i.e. abc.DAT)

${file:pathElement(record:attribute('file'), -1)} is returning the name of the zip file itself, and not the internal filenames. Is there a way to get the correct string, or would I have to unzip the files before running the pipeline?

answered 2018-03-25 15:02:24 -0500

mstang gravatar image

updated 2018-03-25 15:03:35 -0500

Streamsets wouldn't be aware of the contents of a zip file, the only way to pass it would be as a whole file I believe. Hive also would not be able to use a zip file (without using a transform query or some kind of UDF).

You would have to use an external script to read the contents of your zip (unzip -l) and save it to hbase, mysql, etc so that you could do a lookup in the pipeline.

It would make more sense to have the external script extract the DAT files and rename them adding info like your export id from your zip file name. This way you'd have the option of doing useful things in your pipeline and you'd also be able to query the data natively once you put it in Hive.

