Ask Your Question
0

Solr destination not working correctly

asked 2018-02-15 11:06:10 -0500

anonymous user

Anonymous

updated 2018-02-15 12:24:33 -0500

jeff gravatar image

CDH 5.9.1 Streamsets 3.0.3.0

I have a Solr destination set up, cloud mode, Kerberos enabled.

JSON data going to the destination from a Directory origin.

Field mapping done in the UI.

In batch mode, I get "part" of the batch indexed in Solr. Say for an input batch of 500, I get 174 in Solr, then the thing just hangs. No more documents in Solr, no more batches being requested from the origin...it just sits there. This partial batch appears very quickly in the Solr Admin screen. In record mode, I get ONE document in Solr regardless of the input batch size. The number of documents that actually go into Solr is somewhat random depending on the input batch size. Now, in Solr a batch update is "all or nothing", if any document fails, the whole batch fails. So, why am I getting partial batches?

There are no errors in the Streamsets logs, no errors in the Solr logs, no error records being written by the pipeline.

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted
0

answered 2018-02-15 12:57:34 -0500

badcat914 gravatar image

updated 2018-02-15 13:13:34 -0500

jeff gravatar image

Thanks Jeff, this thing is driving me crazy.

I enabled Kerberos though Cloudera Manager, then while experimenting changed the sdc.properties as per the documentation for enabling Kerberos (though I don't think I should have had to). In sdc.properties I have this:

kerberos.client.enabled=true kerberos.client.keytab=streamsets.keytab kerberos.client.principal=sdc/???-vba... (??? are for privacy, it's a valid principal)

This test runs fine with the Directory origin and a Trash Destination, no errors, all records processed. I can reset the origin.

The only thing I'll note is that the records are BIG, had to set the size on the Directory Origin Data Format Tab to 64k to get them through.

I don't have enough points here to attach the Export JSON, so sorry for the BIG paste...

{
  "pipelineConfig" : {
    "schemaVersion" : 5,
    "version" : 7,
    "pipelineId" : "JSONTest132955077-954a-430f-8309-e8fb76b36817",
    "title" : "JSONTest1",
    "description" : "",
    "uuid" : "35618ed1-85a0-4db9-9d64-0a0903b743c2",
    "configuration" : [ {
      "name" : "executionMode",
      "value" : "STANDALONE"
    }, {
      "name" : "deliveryGuarantee",
      "value" : "AT_LEAST_ONCE"
    }, {
      "name" : "startEventStage",
      "value" : "streamsets-datacollector-basic-lib::com_streamsets_pipeline_stage_destination_devnull_ToErrorNullDTarget::1"
    }, {
      "name" : "stopEventStage",
      "value" : "streamsets-datacollector-basic-lib::com_streamsets_pipeline_stage_destination_devnull_ToErrorNullDTarget::1"
    }, {
      "name" : "shouldRetry",
      "value" : true
    }, {
      "name" : "retryAttempts",
      "value" : -1
    }, {
      "name" : "memoryLimit",
      "value" : "${jvm:maxMemoryMB() * 0.85}"
    }, {
      "name" : "memoryLimitExceeded",
      "value" : "LOG"
    }, {
      "name" : "notifyOnStates",
      "value" : [ "RUN_ERROR", "STOPPED", "FINISHED" ]
    }, {
      "name" : "emailIDs",
      "value" : [ ]
    }, {
      "name" : "constants",
      "value" : [ ]
    }, {
      "name" : "badRecordsHandling",
      "value" : "streamsets-datacollector-basic-lib::com_streamsets_pipeline_stage_destination_recordstolocalfilesystem_ToErrorLocalFSDTarget::1"
    }, {
      "name" : "errorRecordPolicy",
      "value" : "ORIGINAL_RECORD"
    }, {
      "name" : "workerCount",
      "value" : 0
    }, {
      "name" : "clusterSlaveMemory",
      "value" : 2048
    }, {
      "name" : "clusterSlaveJavaOpts",
      "value" : "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Dlog4j.debug"
    }, {
      "name" : "clusterLauncherEnv",
      "value" : [ ]
    }, {
      "name" : "mesosDispatcherURL",
      "value" : null
    }, {
      "name" : "hdfsS3ConfDir",
      "value" : null
    }, {
      "name" : "rateLimit",
      "value" : 0
    }, {
      "name" : "maxRunners",
      "value" : 0
    }, {
      "name" : "shouldCreateFailureSnapshot",
      "value" : true
    }, {
      "name" : "webhookConfigs",
      "value" : [ ]
    }, {
      "name" : "sparkConfigs",
      "value" : [ ]
    }, {
      "name" : "statsAggregatorStage",
      "value" : ""
    } ],
    "uiInfo" : {
      "previewConfig" : {
        "previewSource" : "CONFIGURED_SOURCE",
        "batchSize" : "10",
        "timeout" : "30000",
        "writeToDestinations" : true,
        "executeLifecycleEvents" : false,
        "showHeader" : true,
        "showFieldType" : true,
        "rememberMe" : false
      }
    },
    "stages" : [ {
      "instanceName" : "Directory_01",
      "library" : "streamsets-datacollector-basic-lib",
      "stageName" : "com_streamsets_pipeline_stage_origin_spooldir_SpoolDirDSource",
      "stageVersion" : "9",
      "configuration" : [ {
        "name" : "conf.dataFormatConfig.compression",
        "value" : "NONE"
      }, {
        "name" : "conf.dataFormatConfig.filePatternInArchive",
        "value" : "*"
      }, {
        "name" : "conf.dataFormatConfig.charset",
        "value" : "UTF-8"
      }, {
        "name" : "conf.dataFormatConfig.removeCtrlChars",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.textMaxLineLen",
        "value" : 1024
      }, {
        "name" : "conf.dataFormatConfig.useCustomDelimiter",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.customDelimiter",
        "value" : "\\r\\n"
      }, {
        "name" : "conf.dataFormatConfig.includeCustomDelimiterInTheText",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.jsonContent",
        "value" : "MULTIPLE_OBJECTS"
      }, {
        "name" : "conf.dataFormatConfig.jsonMaxObjectLen",
        "value" : 65556
      }, {
        "name" : "conf.dataFormatConfig.csvFileFormat",
        "value" : "CSV"
      }, {
        "name" : "conf.dataFormatConfig.csvHeader",
        "value" : "NO_HEADER"
      }, {
        "name" : "conf.dataFormatConfig.csvAllowExtraColumns",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.csvExtraColumnPrefix",
        "value" : "_extra_"
      }, {
        "name" : "conf.dataFormatConfig.csvMaxObjectLen",
        "value" : 1024
      }, {
        "name" : "conf.dataFormatConfig.csvCustomDelimiter",
        "value" : "|"
      }, {
        "name" : "conf.dataFormatConfig.csvCustomEscape",
        "value" : "\\"
      }, {
        "name" : "conf.dataFormatConfig.csvCustomQuote",
        "value" : "\""
      }, {
        "name" : "conf.dataFormatConfig.csvEnableComments",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.csvCommentMarker",
        "value" : "#"
      }, {
        "name" : "conf.dataFormatConfig.csvIgnoreEmptyLines",
        "value" : true
      }, {
        "name" : "conf.dataFormatConfig.csvRecordType",
        "value" : "LIST_MAP"
      }, {
        "name" : "conf.dataFormatConfig.csvSkipStartLines",
        "value" : 0
      }, {
        "name" : "conf.dataFormatConfig.parseNull",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.nullConstant",
        "value" : "\\\\N"
      }, {
        "name" : "conf.dataFormatConfig.xmlRecordElement",
        "value" : null
      }, {
        "name" : "conf.dataFormatConfig.includeFieldXpathAttributes",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.xPathNamespaceContext",
        "value" : [ ]
      }, {
        "name" : "conf.dataFormatConfig.outputFieldAttributes",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.xmlMaxObjectLen",
        "value" : 4096
      }, {
        "name" : "conf.dataFormatConfig.logMode",
        "value" : "COMMON_LOG_FORMAT"
      }, {
        "name" : "conf.dataFormatConfig.logMaxObjectLen",
        "value" : 1024
      }, {
        "name" : "conf.dataFormatConfig.retainOriginalLine",
        "value" : false
      }, {
        "name" : "conf.dataFormatConfig.customLogFormat",
        "value" : "%h %l %u %t \"%r\" %>s %b"
      }, {
        "name" : "conf.dataFormatConfig.regex",
        "value" : "^(\\S+) (\\S+) (\\S ...
(more)
edit flag offensive delete link more

Comments

Dumb question, but did you enable the Kerberos checkbox on the Solr destination itself?

jeff gravatar imagejeff ( 2018-02-15 13:12:39 -0500 )edit

Yes I did. Doesn't connect if disable.

badcat914 gravatar imagebadcat914 ( 2018-02-15 13:30:51 -0500 )edit

Can you attach a screenshot showing pipeline metrics? Specifically interested in last batch size as it enters the Solr destination, and a screenshot of the pie chart showing batch processing time.

jeff gravatar imagejeff ( 2018-02-15 14:32:04 -0500 )edit

I'm also curious if you have a sense of how long Solr even takes to index records like this. Can you take a batch of records into a file, then run a Solr index request outside of SDC (ex: using curl) to get a good baseline for reasonable performance?

jeff gravatar imagejeff ( 2018-02-15 14:52:42 -0500 )edit

Jeff, I can't post screenshots (I have insufficient Karma), but, it's showing the batch output from the origin to be 500 (Which is what I have it set to), the pie chart shows about 99% of the time in the origin. We have a standalone Solr Indexer which processes these things...It's very fast.

badcat914 gravatar imagebadcat914 ( 2018-02-15 15:12:13 -0500 )edit
0

answered 2018-02-15 12:23:53 -0500

jeff gravatar image

updated 2018-02-15 12:26:31 -0500

You mention in the mail thread you made some changes to sdc.properties. What changes, specifically?

As a sanity check, can you duplicate the pipeline, reset origin, and use a Trash destination instead? I want to see if the parser in the origin is correctly handling all the input lines as expected. Also, please paste the full pipeline configuration for the origin and destination (or export the pipeline JSON and share that file).

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-02-15 11:06:10 -0500

Seen: 107 times

Last updated: Feb 15