How do I parse IP addresses correctly from a quoted list?

asked 2020-03-02 12:26:35 -0500

anonymous user


updated 2020-03-02 14:03:24 -0500

metadaddy gravatar image

I'm using stream sets to load a file into database which updates continuously. I'm running into a error where fields are mismatched with fields i specified in the JDBC producer.

below is my sample file tail

date=202002228, address="", duration=10ms,

in my pipeline in Field Replacer1, i'm using this

  • **/text ${str:replaceAll(record:value('/text'),"address=",',')} ==> this will remove the address expression
  • /text ${str:split(record:value('/text'), ',')} ===> this will split the file in fields Field Replacer2
  • /text[2] ${str:replaceAll(record:value('/text[2]'),""",'')} ==> this will take care of double quotes**

The issue is sometimes address have two ips in the file tail and that breaks the pipeline. How do i skip the address field if have two ip addresses as following

date=202002228, address=",", duration=10ms,

I'm using community version streamsets-datacollector-all-3.13.0. please, advice if this can be done in better way. I'm using data format as text

I need to only send one record to the database from address=, the first one which is How can this be achieved?

Can i use "if-then-else Expressions" when i get two address in my file tail?

output field= /text
field expression=${record:value('/text')

how can i specify the "address" inside the /text to use "if-then-else" if there are two addresses in address field then pick one and process

I mean, you can always have an expression that looks for a comma character (`,`) in the text value, and sends down a different path if found. Assuming the "multiple addresses" always shows up in this same manner.

jeff gravatar imagejeff ( 2020-03-02 13:21:10 -0500 )edit

My file tail is already separated by comma character. Initially the text value is /text which date=202002228, address=,, duration=10ms,

tommy24b gravatar imagetommy24b ( 2020-03-02 13:30:59 -0500 )edit

1 Answer

answered 2020-03-02 14:00:24 -0500

metadaddy gravatar image

Your problem is that your string replacement can't tell that the comma inside the quotes is different from the commas separating fields. You should take a look at Grok Patterns to parse the fields out of the data. Set the Data Format in your origin to 'Log', and set the Log Format to 'Grok Pattern'. Now you can settings like this to parse out the fields:

  • Grok Pattern Definition: MYDATE %{YEAR}%{MONTHNUM2}%{MONTHDAY}
  • Grok Pattern: %{MYDATE:date}, address=%{QS:address}, duration=%{BASE10NUM:duration}ms, url=%{HOSTNAME:url}

This will set the /address field to something like "" or ",". You can then use a series of expressions to extract the first IP address:

  • /address ${str:replace(record:value('/address'),'"','')} ==> Remove the enclosing quotes
  • /address ${str:split(record:value('/address'),", ")} ==> Split on 'comma space'
  • /address ${record:value('/address[0]')} ==> Just use the first element of the list


image description

image description

I'm assuming that 202002228 is a typo - there are too many characters for it to be a valid date!

The reason i'm using file tail The files to be processed must all share a file name pattern and be fully written. To read data from an active file that is still being written to, use the file tail origin

tommy24b gravatar imagetommy24b ( 2020-03-02 14:07:29 -0500 )edit

Does the Log format supports the same functionality as the file tail? yes there is typo in 202002228 it should be 2020228

tommy24b gravatar imagetommy24b ( 2020-03-02 14:08:08 -0500 )edit

File Tail works with Log data format. Just select it on the Data Format tab instead of Text

metadaddy gravatar imagemetadaddy ( 2020-03-02 14:09:14 -0500 )edit

Is there anyway i can send the file to error, if i run into more than two ips in address fields?

tommy24b gravatar imagetommy24b ( 2020-03-02 15:22:46 -0500 )edit

Now i have destination_address=,http_method=GET in the logfile I'm not able to get grok pattern for destination address

tommy24b gravatar imagetommy24b ( 2020-03-02 23:54:01 -0500 )edit
