Ask Your Question
1

Can't parse XML element names containing colon ':'

asked 2018-02-27 09:40:26 -0500

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

We have a very unique xml format where all the elements have ":" example as per below:

<sh:root>
  <sh:book> </sh:book>
  <sh:genre> </sh:genre>
  <sh:id> </sh:id>
  <sh:book> </sh:book>
  <sh:genre> </sh:genre>
  <sh:id> </sh:id>
  <sh:book> </sh:book>
  <sh:genre> </sh:genre>
  <sh:id> </sh:id>
</sh:root>

We are receiving approximately 4000 files per day with each file varying from 8kb > 500kb I'm working with a single file at the moment in my source file filter. From what I've tried already, I cannot get the file to read correctly. If I edit the xml file and replace the elements : with _, streamsets reads the file elements

  1. How / what is the best way to overcome this issue?
  2. What is the best way to convert these from xml > csv?
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2018-02-28 15:34:17 -0500

metadaddy gravatar image

An XML qualified name has the form

prefix:localPart

where prefix is a namespace prefix and localPart is the actual element name. One of the 'rules' of XML is that:

the namespace prefix, unless it is xml or xmlns, must have been declared in a namespace declaration attribute in either the start-tag of the element where the prefix is used or in an an ancestor element (i.e. an element in whose content the prefixed markup occurs).

(See Namespace Constraint: Prefix Declared)

The XML Parser has no trouble parsing this correctly formed XML, the same as yours except for the namespace declaration in the sh:root element:

<sh:root xmlns:sh="urn:dummy">
  <sh:book>a</sh:book>
  <sh:genre>b</sh:genre>
  <sh:id>c</sh:id>
  <sh:book>d</sh:book>
  <sh:genre>e</sh:genre>
  <sh:id>f</sh:id>
  <sh:book>g</sh:book>
  <sh:genre>h</sh:genre>
  <sh:id>i</sh:id>
</sh:root>

image description

Your files, then, are not actually legal XML. To be able to parse them, you would need to remove the sh: namespace prefixes, or add the namespace declaration to each one.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-02-27 09:40:26 -0500

Seen: 95 times

Last updated: Feb 28