Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Hi!

You best bet to remove all HTML tags from a given input field is to use Jython Evaluator along with Beautiful Soup.

Here's how:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local machine. If all goes well, it will get installed in the default location /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages/')
from bs4 import BeautifulSoup

for record in records:
  try:
    #Create HTML object for record.value['description'] 
    desc_html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the text in a new field 'text'
    record.value['text'] = desc_html_object.get_text()

    # Write record 
    output.write(record)

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

Running the pipeline will result in:

image description

Cheers, Dash

Hi!

You Your best bet to remove all HTML tags from a given input field is to use Jython Evaluator along with Beautiful Soup.

Here's how:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local machine. If all goes well, it will get installed in the default location /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages/')
from bs4 import BeautifulSoup

for record in records:
  try:
    #Create HTML object for record.value['description'] 
    desc_html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the text in a new field 'text'
    record.value['text'] = desc_html_object.get_text()

    # Write record 
    output.write(record)

  except Exception as e:
    # Send record to error
    error.write(record, str(e))

Running the pipeline will result in:

image description

Cheers, Dash