Hadoop FS writes incomplete files in JSON

asked 2018-10-26 15:03:38 -0600

supahcraig gravatar image

I have a kafka origin reading a topic which holds JSON messages. I send those to a Hadoop FS destination writing JSON to HDFS. No matter how I configure the output files (set various max row counts or max sizes or idle times), the final json message in each file is incomplete. Every other message in every file looks to be complete; it is only the final row in each file. And there doesn't appear to be any pattern to where that last message will be truncated. Of course this manifests itself as not being able to run aggregations (i.e. count(*)), which is obviously a huge problem.

IF I instead write as avro, I am able to query the table no problem.

This is baffling, and it renders the Hadoop FS destination somewhat useless since I can't really query the external table I have sitting on top of these files if there is invalid JSON, and I don't necessarily want to write avro for every single pipeline. Plain json is occasionally what I want.

edit retag flag offensive close merge delete

Comments

That's very strange. Which version of SDC are you running? And how big are your JSON messages/objects--not that it should matter... just curious. Anything unusual in sdc.log?

iamontheinet gravatar imageiamontheinet ( 2018-10-29 15:11:43 -0600 )edit

On v3.4.2. The json messages are not particularly big, 12 fields of strings & numbers. No arrays or other complex stuff. The strings are all less than 100 characters. The fact that no messages have any trouble OTHER than whichever happens to be the last one in a file tells me it's not data.

supahcraig gravatar imagesupahcraig ( 2018-10-29 21:17:27 -0600 )edit