Ask Your Question
1

failing to read xml file

asked 2018-02-20 10:50:28 -0500

daryll-g gravatar image

updated 2018-02-21 23:54:26 -0500

I've checked the questions I could find regarding XML reading issues on this forum. I removed the BOM from my test.xml file to eliminate that possibility: https://ask.streamsets.com/question/5...

For some reason, my source test.xml file is not being read

I've verified that my file ownership is correct and that the file can be accessed by SS. I've also used the raw file preview and can confirm that the file is accessible.

Any help is greatly appreciated!

The screenshots are: my directory source setup (1+2) The preview data showing "no preview records

Screenshots of my pipeline: source setting source setting xml flattener

edit retag flag offensive close merge delete

Comments

It sounds like there are no errors, right - just no records? Can you give an outline of the XML file structure?

metadaddy gravatar imagemetadaddy ( 2018-02-20 14:07:45 -0500 )edit

<family> <lastname>Groenewald</lastname> <father>James William Chalmers</father> <mother>Lucille Athelia</mother> <sibling1>Shaun</sibling1> <sibling2>Marc</sibling2> <sibling3>Daryll</sibling3> </family>

daryll-g gravatar imagedaryll-g ( 2018-02-20 23:45:28 -0500 )edit

File attached - rename to xml as uploading prevents me from uploading xml files [C:\fakepath\family(rename to xml).png](/upfiles/15191920142295838.png)

daryll-g gravatar imagedaryll-g ( 2018-02-20 23:48:54 -0500 )edit

I've been able to upload my 3 screenshots also (something that I was not able to do yesterday - probably for being a 1st time login) Please see in original feed

daryll-g gravatar imagedaryll-g ( 2018-02-21 00:17:25 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
1

answered 2018-02-21 15:35:49 -0500

metadaddy gravatar image

It's the byte order mark (BOM) again. Looking at the file you uploaded with hexdump:

$ hexdump -C ~/family.xml
00000000  ef bb bf 3c 3f 78 6d 6c  20 76 65 72 73 69 6f 6e  |...<?xml version|
00000010  3d 22 31 2e 30 22 20 65  6e 63 6f 64 69 6e 67 3d  |="1.0" encoding=|
00000020  22 75 74 66 2d 38 22 3f  3e 0d 0a 3c 46 61 6d 69  |"utf-8"?>..<Fami|
00000030  6c 79 3e 0d 0a 09 3c 4c  61 73 74 4e 61 6d 65 3e  |ly>...<LastName>|
00000040  47 72 6f 65 6e 65 77 61  6c 64 3c 2f 4c 61 73 74  |Groenewald</Last|
00000050  4e 61 6d 65 3e 0d 0a 09  3c 46 61 74 68 65 72 3e  |Name>...<Father>|
00000060  4a 61 6d 65 73 20 57 69  6c 6c 69 61 6d 20 43 68  |James William Ch|
00000070  61 6c 6d 65 72 73 3c 2f  46 61 74 68 65 72 3e 0d  |almers</Father>.|
00000080  0a 09 3c 4d 6f 74 68 65  72 3e 4c 75 63 69 6c 6c  |..<Mother>Lucill|
00000090  65 20 41 74 68 65 6c 69  61 3c 2f 4d 6f 74 68 65  |e Athelia</Mothe|
000000a0  72 3e 0d 0a 09 3c 53 69  62 6c 69 6e 67 31 3e 53  |r>...<Sibling1>S|
000000b0  68 61 75 6e 3c 2f 53 69  62 6c 69 6e 67 31 3e 0d  |haun</Sibling1>.|
000000c0  0a 09 3c 53 69 62 6c 69  6e 67 32 3e 4d 61 72 63  |..<Sibling2>Marc|
000000d0  3c 2f 53 69 62 6c 69 6e  67 32 3e 0d 0a 09 3c 53  |</Sibling2>...<S|
000000e0  69 62 6c 69 6e 67 33 3e  44 61 72 79 6c 6c 3c 2f  |ibling3>Daryll</|
000000f0  53 69 62 6c 69 6e 67 33  3e 0d 0a 3c 2f 46 61 6d  |Sibling3>..</Fam|
00000100  69 6c 79 3e                                       |ily>|

Those first three bytes are the BOM. It's easy to remove with tail:

$ tail +4c ~/family.xml > ~/family2.xml

The modified file:

$ hexdump -C ~/family2.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 75 74  |.0" encoding="ut|
00000020  66 2d 38 22 3f 3e 0d 0a  3c 46 61 6d 69 6c 79 3e  |f-8"?>..<Family>|
00000030  0d 0a 09 3c 4c 61 73 74  4e 61 6d 65 3e 47 72 6f  |...<LastName>Gro|
00000040  65 6e 65 77 61 6c 64 3c  2f 4c 61 73 74 4e 61 6d  |enewald</LastNam|
00000050  65 3e 0d 0a 09 3c 46 61  74 68 65 72 3e 4a 61 6d  |e>...<Father>Jam|
00000060  65 73 20 57 69 6c 6c 69  61 6d 20 43 68 61 6c 6d  |es William Chalm|
00000070  65 ...
(more)
edit flag offensive delete link more

Comments

Thanks @metadaddy I actually looked at your post on the BOM: https://ask.streamsets.com/question/517/content-is-not-allowed-in-prolog-error-parsing-xml/ I misunderstood that post> I understood from there, that it was the 1st line that was problematic.

daryll-g gravatar imagedaryll-g ( 2018-02-21 23:45:24 -0500 )edit

cont: After reading that post, I removed the whole 1st line with vi but still had the same outcome. Will that answer from question 517 mentioned above work on files that do not have the BOM? i.e. if exists BOM (remove BOM and move to new location) else (move to new location unedited)?

daryll-g gravatar imagedaryll-g ( 2018-02-21 23:49:38 -0500 )edit

That pipeline just replaces the first line of the file with the standard XML prolog, so the file would technically be edited, but it should be the same as it was.

metadaddy gravatar imagemetadaddy ( 2018-02-22 13:44:07 -0500 )edit

Be careful - vi is likely being clever - saving with the BOM. Use hexdump to see what're really happening!

metadaddy gravatar imagemetadaddy ( 2018-02-22 13:44:58 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-02-20 10:50:28 -0500

Seen: 57 times

Last updated: Feb 21