Ask Your Question
1

'Content is not allowed in prolog' error parsing XML

asked 2017-11-06 16:40:29 -0600

danf gravatar image

updated 2017-11-07 11:58:22 -0600

metadaddy gravatar image

New to streamsets, so I apologize in advance if I am doing something goofy. All I want to do is parse an xml file with the following format

<?xml version="1.0" encoding="utf-8"?>
<ordata>
  <row Id="2" Id2="1" Count="7" ... />
.
.
.
</ordata>

I've tried multiple combinations of directory reader, with the XML ata format, including xpath /ordata/row/ and row as the record delimiter, and nothing as record delimiter. Wondering if it's because all the fields are attributes, or that there's no explicit end tag. In preview all I get back is

Event Record1 (new-file): {MAP}
  filepath: {STRING} "/STREAMSETS/so/source/Data.xml

The sdc log file contains the following error:

2017-11-06 17:29:32,251 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle start event with stage
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR SpoolDirSource - Failed to process file '/STREAMSETS/SO/source/Data.xml' at position '-1': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:652)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:510)
        at com.streamsets.pipeline.configurablestage.DSource.produce(DSource.java:38)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:228)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:222)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:180)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:249)
        at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:231)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:315)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:214)
        at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:510)
        at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51)
        at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
        at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
        at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:80)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.getParser(XmlDataParserFactory.java:60)
        at com ...
(more)
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2017-11-07 12:02:12 -0600

metadaddy gravatar image

Your file has a UTF-8 byte order mark (BOM) at the beginning. This is actually not recommended for UTF-8 data and the Java parsers don't handle it.

You can work around it by creating a pipeline to preprocess the XML, removing the BOM. Here's one I just created:

image description

Note - I don't even use regular expressions to remove the BOM, since we know the offending line is at offset zero and we know exactly what the first line of the file should be, so we just use that verbatim. Here's the expression, so it's easy to copy/paste:

${(record:attribute('offset') == 0) ? '<?xml version="1.0" encoding="utf-8"?>' : record:value('/text')}
edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-11-06 16:40:29 -0600

Seen: 39 times

Last updated: Nov 07 '17