Ask Your Question

Removing special characters from field names in Data Collector Pipeline

asked 2019-11-07 05:15:50 -0600

lutz.kuenneke gravatar image

updated 2019-11-07 10:58:51 -0600

metadaddy gravatar image


we are running Streamsets 3.11.0 Data Collector and using the JDBC Multitable consumer to read some tables from a MS SQL database. SDC is running on CENTOS 7 with around 12G java heap space.

The origin has column names with german Umlauts (ä, ö, ü) and Brackets. We want to remove those in Streamsets because they produce issues later on.

I have chained 8 Field Renamer processors configured like for Example:
Source Field Expression: /'(.)[Ä](.)'
Target Field Expression: /$1AE$2

the approach produces the correct result but kills performance. At times we drop even below 1 record / second. Without all the field renamers performance is around 1000 rows / second.

Is there a more efficient way to do this?

Thank you and best regards

edit retag flag offensive close merge delete

2 Answers

Sort by » oldest newest most voted

answered 2019-11-08 12:19:41 -0600

jeff gravatar image

The Field Mapper processor provides a fairly performant way to accomplish this. See here.

edit flag offensive delete link more

answered 2019-11-07 10:58:35 -0600

metadaddy gravatar image

I can think of a couple of alternative ways to do this:

  • Use one of the script evaluators (Groovy tends to have the best performance) and do the same transformations in script. The advantage here is that you can scan through the field names on each record just once, rather than again and again. The evaluators can also store state, so you could even build a map of input to output field names.
  • Write a custom processor to do the same thing in Java. This will give you the best performance, but you need to build a jar and load it into Data Collector.
edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2019-11-07 05:15:50 -0600

Seen: 17 times

Last updated: Nov 08