Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Hi!

Your best bet to access, parse, or remove HTML tags from a given input source is to use Jython Evaluator along with Beautiful Soup.

Here are the steps to get you started:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local/SDC machine. If all goes well, it will get installed in a location similar to /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages/')

from bs4 import BeautifulSoup

for record in records:
  try:
    #Create HTML object for record field name 'description' that contains HTML markup
    html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the parsed text in a new field 'text'
    record.value['text'] = desc_object.get_text()

    # Write record 
    output.write(record)
  except Exception as e:
    # Send record to error
    error.write(record, str(e))

This will result in:

image description

Similarly, there's a lot you can do with Beautiful Soup including parsing. For details on that, refer to their documentation.

Hope this helps.

Cheers, Dash