Ask Your Question

Using StreamSets for HTML parsing

asked 2019-06-19 08:14:10 -0600

DataAnalyst1029 gravatar image

While this not the main function of my pipeline, I would like to include an HTTP origin data source to parse a website's data to include in my calculations/output. Is there a specific processor I should use to keep the website element hierarchy intact? There is also a <script> tag with various functions containing information I would like to use.

My question is: Would StreamSets be the best tool for scraping data from Websites, or should I attempt to use a third-party tool first, to then be ingested by my StreamSets pipeline?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2019-06-19 13:17:46 -0600

iamontheinet gravatar image


Your best bet to access, parse, or remove HTML tags from a given input source is to use Jython Evaluator along with Beautiful Soup.

Here are the steps to get you started:

Step 1: Install Beautiful Soup for Python 2.7 by executing pip2 install beautifulsoup4 on your local/SDC machine. If all goes well, it will get installed in a location similar to /usr/local/lib/python2.7/site-packages/.

Step 2: Add Jython Evaluator in your pipeline and copy-paste the following code in its Script section:

import sys

from bs4 import BeautifulSoup

for record in records:
    #Create HTML object for record field name 'description' that contains HTML markup
    html_object = BeautifulSoup(record.value['description'], 'html.parser')

    #Strip all HTML tags and store the parsed text in a new field 'text'
    record.value['text'] = desc_object.get_text()

    # Write record 
  except Exception as e:
    # Send record to error
    error.write(record, str(e))

This will result in:

image description

Similarly, there's a lot you can do with Beautiful Soup including parsing. For details on that, refer to their documentation.

Hope this helps.

Cheers, Dash

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2019-06-19 08:14:10 -0600

Seen: 249 times

Last updated: Jun 19 '19