HTML removal in pipeline

asked 2018-12-11

anonymous user


I am importing json data into a flat csv file from a json rest api and there is html in a few of the columns. The html is messing up the order of columns. I would like to know if streamsets has the ability to remove html tags inside a field.

I was looking at the expression evaluator and expression language documentation but couldn't connect the dots.

Can you post an example of the data, and how you want it to look?

metadaddy ( 2018-12-11 )

Below is the update that was requested (sample data). The output is a csv file. There is a description field that has html in the source data and I would like to use a streamsets pipeline to remove html SAMPLE JSON DATA OUTPUT

Somniac ( 2018-12-12 )

answered 2018-12-12

iamontheinet gravatar image

updated 2018-12-12 12:59:51 -0500


Your best bet to remove all HTML tags from a given input field is to use Jython Evaluator along with Beautiful Soup.

Here's how:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local machine. If all goes well, it will get installed in the default location /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
from bs4 import BeautifulSoup

for record in records:
    #Create HTML object for record.value['description'] 
    desc_html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the text in a new field 'text'
    record.value['text'] = desc_html_object.get_text()

    # Write record 

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

Running the pipeline will result in:

image description

Cheers, Dash

awesome man appreciate the help!

Somniac ( 2018-12-12 )

You're welcome!

iamontheinet ( 2018-12-12 )
