Parse XML from StackOverflow

asked 2018-01-02

I want to be able to parse a stackoverflow xml dump that has a very simple format of each row element having all values in attributes, into say a .csv

  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="506" ViewCount="32399" Body="&lt;p&gt;I want to use a track-bar to change a form's opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;...  CommunityOwnedDate="2012-10-31T16:42:47.213" />

but have been unable to crack the code on parsing out the attributes using the xml parser. One caveat is that not all <row> elements have the same attributes, and I believe that there's sub nested elements in some of the row elements. Maybe I am going about this wrong and should just use jython. Have parsed the using spark, but would prefer to use SS. Anybody have any pointers?

Thank you!

If my answer below doesn't point you in the right direction, please add a reference to the actual posts.xml you're working with and I can take a closer look.

metadaddy ( 2018-01-02 )

answered 2018-01-02

metadaddy

I created a sample Posts.xml file using the data in this Meta StackExchange answer:

<?xml version="1.0" encoding="utf-8"?>
  <row Id="1" PostTypeId="1" Tags="&lt;discussion&gt;&lt;scope&gt;&lt;homebrew&gt;" AnswerCount="3"/>
  <row Id="2" PostTypeId="2" ParentId="1"  />
  <row Id="3" PostTypeId="1" Tags="&lt;discussion&gt;&lt;scope&gt;" AnswerCount="2" CommentCount="0" />
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" Tags="&lt;discussion&gt;" AnswerCount="2" CommentCount="0" />
  <row Id="6" PostTypeId="2" ParentId="4" />
  <row Id="7" PostTypeId="2" ParentId="4" />

I used the Directory origin, with Delimiter Element set to row:

image description

It parses the data just fine - here's the preview:

image description

