Ask Your Question
1

HTML removal in pipeline

asked 2018-12-11 16:02:32 -0500

anonymous user

Anonymous

I am importing json data into a flat csv file from a json rest api and there is html in a few of the columns. The html is messing up the order of columns. I would like to know if streamsets has the ability to remove html tags inside a field.

I was looking at the expression evaluator and expression language documentation but couldn't connect the dots.

https://streamsets.com/documentation/...

https://streamsets.com/documentation/...

edit retag flag offensive close merge delete

Comments

Can you post an example of the data, and how you want it to look?

metadaddy gravatar imagemetadaddy ( 2018-12-11 18:44:24 -0500 )edit

Below is the update that was requested (sample data). The output is a csv file. There is a description field that has html in the source data and I would like to use a streamsets pipeline to remove html SAMPLE JSON DATA https://pastebin.com/hk883tH0 OUTPUT https://pastebin.com/992sFQtz

Somniac gravatar imageSomniac ( 2018-12-12 09:32:14 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
1

answered 2018-12-12 12:56:07 -0500

iamontheinet gravatar image

updated 2018-12-12 12:59:51 -0500

Hi!

Your best bet to remove all HTML tags from a given input field is to use Jython Evaluator along with Beautiful Soup.

Here's how:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local machine. If all goes well, it will get installed in the default location /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages/')
from bs4 import BeautifulSoup

for record in records:
  try:
    #Create HTML object for record.value['description'] 
    desc_html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the text in a new field 'text'
    record.value['text'] = desc_html_object.get_text()

    # Write record 
    output.write(record)

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

Running the pipeline will result in:

image description

Cheers, Dash

edit flag offensive delete link more

Comments

awesome man appreciate the help!

Somniac gravatar imageSomniac ( 2018-12-12 13:01:26 -0500 )edit

You're welcome!

iamontheinet gravatar imageiamontheinet ( 2018-12-12 13:33:00 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-12-11 16:02:32 -0500

Seen: 158 times

Last updated: Dec 12 '18