Hello, I am trying to bring XML files into a SDC pipeline, however, I am getting an error when I set the Directory origin processor Data Format to XML I get this error: SPOOLDIR_01 - Failed to process file '....xml' at position '0': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_02 - XML object exceeded maximum length: readerId '...xml', offset '0', maximum length '506384'

I've realized this occurs because the XML document contains one element that contains a lot of extra text that we don't need. Is there a way to remove a repeating xml section? I would want to remove all <xmleelementtoremove> sections in the document:

            <XMLElementToRemove>lots of text</XMLElementToRemove>
            <XMLElementToRemove>more lengthy text</XMLElementToRemove>

I can't even get it to parse the XML though to begin with, so I'm not sure how to manipulate it to remove those sections, unless there is way to do it from text before it gets converted to XML. Thanks!

I don't know of a clean way to remove this before it gets converted to XML. The best course of action is probably to increase the max record length and then remove <XMLElementToRemove> with the Field Remover processor.

If that is not possible, and the XML files are laid out one element per line, as in your question, it may be possible to use a separate pipeline to read in the files line-by-line as text and filter out any lines containing <XMLElementToRemove>.

Thanks a ton for making my post more readable and answering so promptly! I've set Max Record Length to the highest value it will accept, 2147483647, but unfortunately that's still not enough for this XML document. What's the general approach for reading files line by line and filtering out lines?

