Ask Your Question

Remove xml namespace prefixes before flattening

asked 2018-07-27 08:00:25 -0500

dcwatson84 gravatar image

updated 2018-08-08 11:53:11 -0500

I'd like to remove all XML namespace prefixes so that I don't have to reference them in my processors. However the field renamer wants me to flatten the XML first, which would require me to reference the namespaces, defeating the purpose. Can I generically (using a regular expression) remove xml namespace prefixes from the entire nested field structure of the xml data?

Here's some raw xml...

<ODM ODMVersion="1.3" FileType="Snapshot" FileOID="ad998378" CreationDateTime="2016-06-03T14:59:52" xmlns="" >
    <AdminData studyOID="mystudyoid">
        <User OID="someid" UserType="Other">
            <DisplayName>Tracy R</DisplayName>
            <FullName>Tracy  R</FullName>
            <LocationRef LocationOID="10001" />

What I want is for that raw XML to be parsed and for the field names to look like this...

/ODM/AdminData/User/OID..... etc.

By default it looks something like below...

/ns1:ODM/ns1:AdminData/.... etc

I don't want those namespaces in the field names, and more importantly, I dont want to have to reference them, even in the removal process. Because "ns1" is something streamsets is adding, for all I know that hard-coded string could change in the next streamsets version to "NSONE", and I can't build a pipeline with a dependency on a random hard-coded string.

So I want them removed generically. I wouldn't mind writing a regex, however that means I need to flatten the structure first, and I can't do that. I need this removal to happen while the data is still hierarchical.

FWIW - "Not possible" is a valid answer to this question.

edit retag flag offensive close merge delete


That doesn't remove the namespace prefixes.

dcwatson84 gravatar imagedcwatson84 ( 2018-07-27 16:52:44 -0500 )edit

Right, but once the hierarchy is flat, you should be able to remove namespaces with field renamer, using a regex.

metadaddy gravatar imagemetadaddy ( 2018-07-27 17:08:51 -0500 )edit

If you can edit your question and add a small sample of the XML and the result you're looking for, I'll see if I can create a sample pipeline.

metadaddy gravatar imagemetadaddy ( 2018-07-27 17:10:11 -0500 )edit

I updated the question with some XML. It shows what I'm talking about with the namespaces. Note the important part, which is that I don't want to have to reference the namespace "ns1". That string is not part of my XML, it's something SS comes up with.

dcwatson84 gravatar imagedcwatson84 ( 2018-08-08 10:36:44 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted

answered 2018-08-08 12:11:55 -0500

metadaddy gravatar image

Data Collector prefixes the field names with ns1, ns2 etc so that, if you have the same tag name in multiple namespaces, you don't lose that information. For an XML document with a single namespace, the elements will always be prefixed ns1.

If you remove the namespace from the XML document, the fields will not be prefixed - you will see them as /ODM, /ODM/AdminData etc.

Another possibility is to use a script evaluator to traverse the field hierarchy and change the field names and record structure - see this answer on the sdc-user Google Group.

I just filed SDC-9729 to request this enhancement. Feel free to watch/vote/comment that issue.

edit flag offensive delete link more



I saw that removing the namespace declaration altogether was really the only easy way to do this. Unfortunately it means messing with the raw source data, but that's easier than having to reference "ns1:" everywhere. Thanks!

dcwatson84 gravatar imagedcwatson84 ( 2018-08-08 18:23:38 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-07-27 08:00:25 -0500

Seen: 44 times

Last updated: Aug 08