Ask Your Question
0

How do i skip certain fields which doesn't match my pattern from the streamsets pipline?

asked 2020-03-03 13:12:37 -0500

tommy24b gravatar image

updated 2020-03-03 16:56:52 -0500

metadaddy gravatar image

I'm using stream sets to load f5 log file into database which updates continuously. I'm running into a error where fields are mismatched with fields i specified in the JDBC producer.

below is my sample file tail

34825694275,2020-02-28T04:49:35-08:00,110.10.100.17,tm[2311]: sol_key=HTTPLog,v_s=/cvgdge/hsksws/fks-vs,s_a=118.21.2.2024,d_a=13.2.4.42,h_s=100,h_p_l=109,d_s=4,h_m=GET,h_p=peoples.welfare.ca,h_i=/tuis/fhdjk/er/login.se,h_q="",h_ref="https://yahoo.com",http_useragent="Mozilla/5.0 "

All the values are comma separated which helps me to use an expression evaluator to convert into text fields and match them in jdbc producer as required.

Sometime in the log file there are multiple addresses for s_a= and d_a=, when that happens the sdc pipelines breaks with following, which makes sense that text fields are not matching the jdbc columns

Pipeline Status: RUNNING_ERROR: For input string: "0."
java.lang.NumberFormatException: For input string: "0."
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at oracle.sql.NUMBER.toBytes(NUMBER.java:1904)
at oracle.sql.NUMBER.(NUMBER.java:287)

I'm using source as file tail and data format as text I'm using Field Replacer to get rid of sol_key, s_a, d_a and so on after field replacer, the file looks like

34825694275,2020-02-28T04:49:35-08:00,110.10.100.17,tm[2311]: HTTPLog,/cvgdge/hsksws/fks-vs,118.21.2.2024,13.2.4.42,100,109,4,GET,peoples.welfare.ca,/tuis/fhdjk/er/login.se,"","https://yahoo.com","Mozilla/5.0"

i using expression evaluator that give me

 Record :
MAP
  text :
LIST [ 17 ]
 0 :
STRING 
 1 : 34825694275
STRING 2020-02-28T04:49:35-08:00
 2 :
STRING 110.10.100.17
 3 :
STRING tm[2311]:
 4 :
STRING HTTPLog
 5 :
STRING /cvgdge/hsksws/fks-vs
 6 :
STRING 118.21.2.2024
 7 :
STRING 13.2.4.42
 8 :
STRING 100
 9 :
STRING 109
 10 :
STRING 4
 11 :
STRING GET
 12 :
STRING peoples.welfare.ca,/tuis/fhdjk/er/login.se
 13 :
STRING ""
 14 :
STRING ""
 15 :
STRING "https://yahoo.com",
 16 :
STRING "Mozilla/5.0"

I'm using another replacer to get rid of double quotes, tm:, as per my requirement which i can load into database using jdbc producer

The issue is sometimes s_a have two ips in the file tail and that breaks the pipeline. which results in 18 text fields which will mismatch with jdbc producer I want to skip the if there are two ip address in file tail for s_a=s_a=118.21.2.2024,116.22.3.2312 in fact if the number fields exceeds what i mentioned in jdbc producer. I want to skip them and send them to error.

How can this be done? I'm using community version streamsets-datacollector-3.13.0

I want to use another expression evaluator and send the ips to error for troubleshooting later ... (more)

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2020-03-03 17:04:24 -0500

metadaddy gravatar image

As I mentioned in my answer to a similar question, the key here is to use a Grok pattern to assign names to the fields. Once you've done that, you can use a Stream Selector to send records that contain a comma in the IP address field along a different path, or a precondition to send them to the error stream.

The StreamSets Data Collector tutorial has an example of using a Stream Selector.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2020-03-03 13:12:37 -0500

Seen: 25 times

Last updated: Mar 03