Hive SocketTimeout causing records to be dropped

asked 2020-02-20 15:24:51 -0500

tclaw46 gravatar image

updated 2020-02-20 18:26:23 -0500

metadaddy gravatar image

My company recently began using StreamSets with Cloudera and are still going through some issues getting everything in working order. One issue we discovered is that from time to time, when StreamSets attempts to upload records to Hive, there is a socket timeout error as described below:

"errorMessage": "HIVE_23 - TBL Properties 'com.streamsets.pipeline.stage.lib.hive.exceptions.HiveStageCheckedException: HIVE_20 - Error executing SQL: DESCRIBE DATABASE `cbw_fraud`, Reason:org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out' Mismatch: Actual: {} , Expected: {}"

This seems to correct itself as most of the records are eventually written to Hive, but when the error first occurs, some of the records it was attempting to upload get dropped instead of being uploaded.

Now this appears to be more of an issue with with Hive instead of StreamSets and we are trying to figure out the cause. In the meantime though, we need a workaround to prevent data from being lost. Is there a away for a pipeline to automatically rerun records that error out base on specific errors? failing that, can the error records be written to a file without the error listed so that we can just gather every record that errors out in one new file and ingest that one?

We just upgraded StreamSets to 3.13 so it is up-to-date.

edit retag flag offensive close merge delete

Comments

Does the pipeline stop and restart when this happens, or just carry on running, showing error records? If the latter, you should be able to configure the pipeline's error stream.

metadaddy gravatar imagemetadaddy ( 2020-02-20 18:27:21 -0500 )edit

The pipeline just carries on running until attempting to connect to hive with each record until it succeeds. If it doesn't, the record gets dropped. How can I configure the error stream?

tclaw46 gravatar imagetclaw46 ( 2020-02-21 09:37:12 -0500 )edit

https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Pipeline_Design/ErrorHandling.html#concept_pm4_txm_vq

metadaddy gravatar imagemetadaddy ( 2020-02-21 13:40:12 -0500 )edit

If you're a StreamSets customer, please open a support ticket. We should not be dropping records, even if there is a connectivity problem.

metadaddy gravatar imagemetadaddy ( 2020-02-21 13:40:40 -0500 )edit
1

We did open a ticket. The response so far seems to have been that there doesn't seem to be much they can do if the issue isn't directly caused by the sdc. Which is why we have been looking for other ways until we can fix the connection issue with hive

tclaw46 gravatar imagetclaw46 ( 2020-02-21 15:41:29 -0500 )edit