Ask Your Question

HTML removal in pipeline

asked 2018-12-11 16:02:32 -0500

anonymous user


I am importing json data into a flat csv file from a json rest api and there is html in a few of the columns. The html is messing up the order of columns. I would like to know if streamsets has the ability to remove html tags inside a field.

I was looking at the expression evaluator and expression language documentation but couldn't connect the dots.

edit retag flag offensive close merge delete


Can you post an example of the data, and how you want it to look?

metadaddy gravatar imagemetadaddy ( 2018-12-11 18:44:24 -0500 )edit

Below is the update that was requested (sample data). The output is a csv file. There is a description field that has html in the source data and I would like to use a streamsets pipeline to remove html SAMPLE JSON DATA OUTPUT

Somniac gravatar imageSomniac ( 2018-12-12 09:32:14 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted

answered 2018-12-12 12:56:07 -0500

iamontheinet gravatar image

updated 2018-12-12 12:59:51 -0500


Your best bet to remove all HTML tags from a given input field is to use Jython Evaluator along with Beautiful Soup.

Here's how:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local machine. If all goes well, it will get installed in the default location /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
from bs4 import BeautifulSoup

for record in records:
    #Create HTML object for record.value['description'] 
    desc_html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the text in a new field 'text'
    record.value['text'] = desc_html_object.get_text()

    # Write record 

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

Running the pipeline will result in:

image description

Cheers, Dash

edit flag offensive delete link more


awesome man appreciate the help!

Somniac gravatar imageSomniac ( 2018-12-12 13:01:26 -0500 )edit

You're welcome!

iamontheinet gravatar imageiamontheinet ( 2018-12-12 13:33:00 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-12-11 16:02:32 -0500

Seen: 213 times

Last updated: Dec 12 '18