Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

This should be easy - simple XML parsing

New to streamsets, so I apologize in advance if I am doing something goofy. All I want to do is parse an xml file with the following format

<ordata> <row id="2" id2="1" count="7" ...=""/> . . . </ordata>

I've tried multiple combinations of directory reader, with the XML ata format, including xpath /ordata/row/ and row as the record delimiter, and nothing as record delimiter. Wondering if it's because all the fields are attributes, or that there's no explicit end tag. In preview all I get back is

Event Record1 (new-file): {MAP} filepath: {STRING} "/STREAMSETS/so/source/Data.xml

The sdc log file contains the following error:

2017-11-06 17:29:32,251 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO Pipeline - Processing lifecycle start event with stage 2017-11-06 17:29:32,254 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR SpoolDirSource - Failed to process file '/STREAMSETS/SO/source/Data.xml' at position '-1': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0' com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0' at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:652) at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:510) at com.streamsets.pipeline.configurablestage.DSource.produce(DSource.java:38) at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:228) at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:222) at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:180) at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:249) at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:231) at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:315) at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:214) at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:510) at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51) at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206) at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249) at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0' at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:80) at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.getParser(XmlDataParserFactory.java:60) at com.streamsets.pipeline.lib.parser.WrapperDataParserFactory.getParser(WrapperDataParserFactory.java:65) at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:585) ... 22 more Caused by: java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1] Message: Content is not allowed in prolog. at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:89) at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:77) ... 25 more Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1] Message: Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596) at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276) at javax.xml.stream.util.EventReaderDelegate.peek(EventReaderDelegate.java:104) at com.streamsets.pipeline.lib.xml.StreamingXmlParser.skipIgnorable(StreamingXmlParser.java:232) at com.streamsets.pipeline.lib.xml.StreamingXmlParser.hasNext(StreamingXmlParser.java:238) at com.streamsets.pipeline.lib.xml.StreamingXmlParser.<init>(StreamingXmlParser.java:113) at com.streamsets.pipeline.lib.xml.OverrunStreamingXmlParser.<init>(OverrunStreamingXmlParser.java:59) at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:80) ... 26 more 2017-11-06 17:29:32,254 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR DirectorySpooler - Leaving file in error '/STREAMSETS/SO/source/Data.xml' in spool directory 2017-11-06 17:29:32,254 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO Pipeline - Destroying pipeline with reason=UNKNOWN 2017-11-06 17:29:32,255 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO Pipeline - Processing lifecycle stop event 2017-11-06 17:29:32,255 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO Pipeline - Pipeline finished destroying with final reason=FAILURE 2017-11-06 17:29:33,444 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:webserver-127] WARN StandaloneAndClusterPipelineManager - Evicting idle previewer 'SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e::0'::'47ee8166-ccaf-4e87-b576-e030695edc91' in status 'FINISHED

Thank you!

Click here to Reply

This should be easy - simple XML parsing

New to streamsets, so I apologize in advance if I am doing something goofy. All I want to do is parse an xml file with the following format

<?xml version="1.0" encoding="utf-8"?>
<ordata>
  <row id="2" id2="1" count="7" ...=""/>
Id="2" Id2="1" Count="7" ... />
.
.
.
</ordata>

</ordata>

I've tried multiple combinations of directory reader, with the XML ata format, including xpath /ordata/row/ and row as the record delimiter, and nothing as record delimiter. Wondering if it's because all the fields are attributes, or that there's no explicit end tag. In preview all I get back is

Event Record1 (new-file): {MAP}
  filepath: {STRING} "/STREAMSETS/so/source/Data.xml

"/STREAMSETS/so/source/Data.xml

The sdc log file contains the following error:

2017-11-06 17:29:32,251 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle start event with stage
2017-11-06 17:29:32,254 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR SpoolDirSource - Failed to process file '/STREAMSETS/SO/source/Data.xml' at position '-1': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:652)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:510)
        at com.streamsets.pipeline.configurablestage.DSource.produce(DSource.java:38)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:228)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:222)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:180)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:249)
        at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:231)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:315)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:214)
        at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:510)
        at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51)
        at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
        at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
        at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:80)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.getParser(XmlDataParserFactory.java:60)
        at com.streamsets.pipeline.lib.parser.WrapperDataParserFactory.getParser(WrapperDataParserFactory.java:65)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:585)
        ... 22 more
Caused by: java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:89)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:77)
        ... 25 more
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276)
        at javax.xml.stream.util.EventReaderDelegate.peek(EventReaderDelegate.java:104)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.skipIgnorable(StreamingXmlParser.java:232)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.hasNext(StreamingXmlParser.java:238)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.<init>(StreamingXmlParser.java:113)
        at com.streamsets.pipeline.lib.xml.OverrunStreamingXmlParser.<init>(OverrunStreamingXmlParser.java:59)
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:80)
        ... 26 more
2017-11-06 17:29:32,254 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR DirectorySpooler - Leaving file in error '/STREAMSETS/SO/source/Data.xml' in spool directory
2017-11-06 17:29:32,254 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Destroying pipeline with reason=UNKNOWN
2017-11-06 17:29:32,255 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle stop event
2017-11-06 17:29:32,255 [user:admin] [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Pipeline finished destroying with final reason=FAILURE
2017-11-06 17:29:33,444 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:webserver-127] WARN  StandaloneAndClusterPipelineManager - Evicting idle previewer 'SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e::0'::'47ee8166-ccaf-4e87-b576-e030695edc91' in status 'FINISHED

Thank you!

Click here to Reply

'FINISHED'

This should be easy - simple XML parsing

New to streamsets, so I apologize in advance if I am doing something goofy. All I want to do is parse an xml file with the following format

<?xml version="1.0" encoding="utf-8"?>
<ordata>
  <row Id="2" Id2="1" Count="7" ... />
.
.
.
</ordata>

I've tried multiple combinations of directory reader, with the XML ata format, including xpath /ordata/row/ and row as the record delimiter, and nothing as record delimiter. Wondering if it's because all the fields are attributes, or that there's no explicit end tag. In preview all I get back is

Event Record1 (new-file): {MAP}
  filepath: {STRING} "/STREAMSETS/so/source/Data.xml

The sdc log file contains the following error:

2017-11-06 17:29:32,251 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle start event with stage
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR SpoolDirSource - Failed to process file '/STREAMSETS/SO/source/Data.xml' at position '-1': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:652)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:510)
        at com.streamsets.pipeline.configurablestage.DSource.produce(DSource.java:38)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:228)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:222)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:180)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:249)
        at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:231)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:315)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:214)
        at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:510)
        at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51)
        at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
        at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
        at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:80)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.getParser(XmlDataParserFactory.java:60)
        at com.streamsets.pipeline.lib.parser.WrapperDataParserFactory.getParser(WrapperDataParserFactory.java:65)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:585)
        ... 22 more
Caused by: java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:89)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:77)
        ... 25 more
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276)
        at javax.xml.stream.util.EventReaderDelegate.peek(EventReaderDelegate.java:104)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.skipIgnorable(StreamingXmlParser.java:232)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.hasNext(StreamingXmlParser.java:238)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.<init>(StreamingXmlParser.java:113)
        at com.streamsets.pipeline.lib.xml.OverrunStreamingXmlParser.<init>(OverrunStreamingXmlParser.java:59)
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:80)
        ... 26 more
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR DirectorySpooler - Leaving file in error '/STREAMSETS/SO/source/Data.xml' in spool directory
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Destroying pipeline with reason=UNKNOWN
2017-11-06 17:29:32,255 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle stop event
2017-11-06 17:29:32,255 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Pipeline finished destroying with final reason=FAILURE
2017-11-06 17:29:33,444 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:webserver-127] WARN  StandaloneAndClusterPipelineManager - Evicting idle previewer 'SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e::0'::'47ee8166-ccaf-4e87-b576-e030695edc91' in status 'FINISHED'