Ask Your Question

Handling big files

asked 2018-03-02 07:31:21 -0500

mikygit gravatar image


I spoke too soon and it actually still does not work.

I'm getting the following error in a simple Directory -> Trash workflow.

XML_PARSER_01 - Cannot obtain current reader position: com.streamsets.pipeline.lib.parser.DataParserException:
XML_PARSER_01 - Cannot obtain current reader position: ParseError at
[row,col]:[75393,26] Message: Reader exceeded the read limit '1048576'

I'm using a delimiter to chunk the input file but it does not fix the problem...

Worflow json and xml file available here:



edit retag flag offensive close merge delete


If you do a preview, is it processing ANY records? If so, then the delimiter is working properly and you simply have an element greater than your max record size, starting at the position in the file indicated by the log line.

jeff gravatar imagejeff ( 2018-03-02 10:17:44 -0500 )edit

Good point. I was focused on the overall size of the XML file but the problem comes from an XML element which is 62395 lines long!!! Is there a way to tackle that problem other than changing the XML source file?

mikygit gravatar imagemikygit ( 2018-03-02 10:26:16 -0500 )edit

You can keep increasing the max record size to a sufficient value to accommodate even that element?

jeff gravatar imagejeff ( 2018-03-02 10:53:51 -0500 )edit

I can't increase above 2147483647, which is not enough unfortunately.

mikygit gravatar imagemikygit ( 2018-03-02 11:01:41 -0500 )edit

As a workaround, would it be possible to ignore such huge element instead of crashing the parsing?

mikygit gravatar imagemikygit ( 2018-03-02 11:03:59 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted

answered 2018-03-05 03:12:34 -0500

mikygit gravatar image

updated 2018-03-19 11:18:13 -0500

I'm still looking for a solution on that problem.

For those who would like to see the problem in action, simply create an HTTPclient on the following: with XML parsing.

It should crash with a 'XML object exceeded maximum length' error.

[Edit] I finally managed to make it wotk by setting properly the parameters DataFactoryBuilder.OverRunLimit and such as:

Blockquote docker rm -f streamsets-dc-2; docker run -p 18630:18630 -v XXX -e SDC_JAVA_OPTS="-Dhttp.proxyHost=$PROXY_HOST -Dhttp.proxyPort=$PROXY_PORT -DDataFactoryBuilder.OverRunLimit=20485760" -e STREAMSETS_LIBRARIES_EXTRA_DIR=XXX -d --name ...

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-03-02 07:31:21 -0500

Seen: 20 times

Last updated: yesterday