Ask Your Question
0

What are some common things that could be causing CSV parsing errors during data movements?

asked 2018-01-31 20:31:22 -0500

lampshadesdrifter gravatar image

updated 2018-02-09 18:13:11 -0500

Using streamsets to move TSV (with header) data between mapr FS locations (in batch cluster mode using hadoop impersonation, if any of that is relevant) and after the pipeline runs for a bit, the pipeline fails and tries to restart continually failing at this point (ie. some records run through successfully until it seems to hit some problem). Looking at the logs, I see the error:

....
Diagnostics : Task failed task_1517001433399_0014_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

2018-01-26 15:52:52,404 ingest2sa2tenant_demodata_batch_002/ingest2sa2tenantdemodatabatch002e19a9d7a-14a2-4c5e-b4a6-965dd34c43ee    ERROR   Error in Slave Runner:  ClusterRunner   *admin      runner-pool-2-thread-23

java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:530)
at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:540)
....

Looking at the error message, it seems that the parser is having trouble with line 2 of one of the tsv files beng processed (there are 4 separate files in the origin location). Redacting some information the line-2 records from each of the tsv files are shown below:

XXXXXX      Puncture wound with foreign body of left thumb without damage to nail   null    INJURIES TO THE WRIST, HAND AND FINGERS null    XXXXXX  Y   Y   null    null    null    null    null    null    null    null    null    null    2   null    null    null    null    null    null    N   N   null    null    null    null    null    null    null    S61.042 2   null    S61.042 XXXX    null    null    null    2017-10-06 00:02:49 XXX 2018-02-05 16:08:59

XXXXXXXXXXX Complex care coordination   null    null    null    null    N   N   null    null    null    null    null    null    null    null    null    null    1   null    null    null    null    null    null    null    null    null    null    null    null    null    null    2   null    null    V65.49  Z71.89  XXXX    null    null    null    2017-10-06 08:32:47 null    2018-02-05 16:08:59

XXXXXX      Unspecified occupant of three-wheeled motor vehicle injured in collision with heavy transport vehicle or bus in traffic accident, initial encounter null    OCCUPANT OF THREE-WHEELED MOTOR VEHICLE INJURED IN TRANSPORT ACCIDENT   null    XXXXXX  Y   Y   null    Motor vehicle collision victim  null    null    null    null    null    null    null    null    2   null    null    null    null    null    null    N   N   null    null    null    null    null    null    null    V34.9XXA    2   null    V34.9XXA    XXXXXX  XXXXXXX null    null    2017-10-05 23:48:49 300 2018-02-05 16:08:59

XXXXXXX     Wedge compression fracture of unspecified lumbar vertebra, subsequent encounter for fracture with nonunion  null    null    null    null    N   null    null    Compression fracture of lumbar vertebra null    null    null    null    null    null    null    null    1   null    null    null    null    null    null    null    null    null    null    null    null    null    null    2   null    null    733.82  S32.000K    XXXXXX  null    null    null    2017-10-06 06:27:38 null    2018-02-05 16:08:59

all line 2's fromm tsv files as displayed using sqlline drill query

| column_0  |                                column_1                                | column_2  |                 column_3                 | column_4  | column_5  | column_6  | column_7  | column_8  | column_9  | column_10  | column_11  | column_12  | column_13  | column_14  | column_15  | column_16  | column_17  | column_18  | column_19  | column_20  | column_21  | column_22  | column_23  | column_24  | column_25  | column_26 ...
(more)
edit retag flag offensive close merge delete

Comments

The wc -m output is showing the total number of characters in the file. As long as each particular line is less than 100,000 characters, that should not be an issue.

jeff gravatar imagejeff ( 2018-02-07 15:01:40 -0500 )edit

Just for testing purposes, can you try changing away from Tab Separated Values to Custom? And specify the delimiter character as a tab there (\u0009)? Then either specify the quote and escape chars or dummy values for those (assuming you are not using quoted field values).

jeff gravatar imagejeff ( 2018-02-07 15:06:49 -0500 )edit

@jeff Specifying Custom "Delimiter type" to Tab and setting Escape Charcter to "other"(\) and Quote Character to "other"("), I am still seeing: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter

lampshadesdrifter gravatar imagelampshadesdrifter ( 2018-02-07 17:02:49 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2018-02-01 16:32:41 -0500

jeff gravatar image

The error message is: invalid char between encapsulated token and delimiter. Basically, the data is malformed with respect to how you have configured the parser. It seems to be on line number 2. Can you please post a snippet of your source data file, as well as the configuring you are using for the delimited data in the pipeline?

edit flag offensive delete link more

Comments

Was not able to fit extra info in comment section, but have updated my original post with some details about the data and configuration. Thanks.

lampshadesdrifter gravatar imagelampshadesdrifter ( 2018-02-01 19:06:23 -0500 )edit

I understand that you have private data you can't share. However, can you anonymize it and share? Even replacing the actual cell values with random characters should be fine. Unfortunately it's the only way to identify what is malformed in the data.

jeff gravatar imagejeff ( 2018-02-01 21:37:23 -0500 )edit

I have added the cleaned data records from line#2 of all of the origin data files. The formatting is not great as I could not figure out how to make side-scrolling text boxes to fit the long data strings.

lampshadesdrifter gravatar imagelampshadesdrifter ( 2018-02-06 19:20:48 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-01-31 20:31:22 -0500

Seen: 260 times

Last updated: Feb 09