Ask Your Question
1

Remove specific XML section?

asked 2018-02-25 21:22:23 -0500

rthebest gravatar image

updated 2018-02-25 23:01:59 -0500

metadaddy gravatar image

Hello, I am trying to bring XML files into a SDC pipeline, however, I am getting an error when I set the Directory origin processor Data Format to XML I get this error: SPOOLDIR_01 - Failed to process file '....xml' at position '0': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_02 - XML object exceeded maximum length: readerId '...xml', offset '0', maximum length '506384'

I've realized this occurs because the XML document contains one element that contains a lot of extra text that we don't need. Is there a way to remove a repeating xml section? I would want to remove all <xmleelementtoremove> sections in the document:

<Top>
    <SecondLevel>
        <ThirdLevel>
            <XMLElementToRemove>lots of text</XMLElementToRemove>
        </ThirdLevel>
        <ThirdLevel>
            <XMLElementToRemove>more lengthy text</XMLElementToRemove>
        </ThirdLevel>
    </SecondLevel>
</Top>

I can't even get it to parse the XML though to begin with, so I'm not sure how to manipulate it to remove those sections, unless there is way to do it from text before it gets converted to XML. Thanks!

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2018-02-25 23:17:05 -0500

metadaddy gravatar image

I don't know of a clean way to remove this before it gets converted to XML. The best course of action is probably to increase the max record length and then remove <XMLElementToRemove> with the Field Remover processor.

If that is not possible, and the XML files are laid out one element per line, as in your question, it may be possible to use a separate pipeline to read in the files line-by-line as text and filter out any lines containing <XMLElementToRemove>.

edit flag offensive delete link more

Comments

Thanks a ton for making my post more readable and answering so promptly! I've set Max Record Length to the highest value it will accept, 2147483647, but unfortunately that's still not enough for this XML document. What's the general approach for reading files line by line and filtering out lines?

rthebest gravatar imagerthebest ( 2018-02-25 23:36:22 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-02-25 21:22:23 -0500

Seen: 24 times

Last updated: Feb 25