Unable to parse XML which has 'ampersand' in the data

asked 2019-10-07 06:08:45 -0500

Kumar

updated 2019-10-09 10:40:11 -0500

metadaddy

I am trying to parse an XML file using the "XML parser" processor. It's working fine in most of the cases. But in one case the data in my file is like <tag>&</tag> and <tag>&abc</tag>.

In these cases the XML parser is unable to parse the XML file. Could you please anyone help me out.

Thanks in advance.

Please paste the full stack trace you are seeing. Or at least provide more details about what is happening.

jeff ( 2019-10-07 13:14:12 -0500 )

XMLP_01 - Cannot XML parse the field '/text' for record '78378217 (1).xml::0': com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_03 - Can't parse XML: ParseError at [row,col]:[1,2081] Message: XML document structures must start and end within the

Kumar ( 2019-10-09 06:58:44 -0500 )

Can you capture this particular record? Either by sending error records to file or looking in the input to find it? Then run it through a tool like xmllint. It appears to be malformed.

jeff ( 2019-10-09 08:45:53 -0500 )

1 Answer

answered 2019-10-09 10:38:09 -0500

metadaddy

The simple answer is that <tag>&</tag> and <tag>&abc</tag> are not legal XML. In XML, the & (ampersand) character denotes an entity reference, and, according to section 2.4 of the XML standard

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively.

You will need to fix the producing application to emit &amp; instead of a raw &, or pre-process the data to do the conversion.

