Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Using StreamSets for HTML parsing

While this not the main function of my pipeline, I would like to include an HTTP origin data source to parse a website's data to include in my calculations/output. Is there a specific processor I should use to keep the website element hierarchy intact? There is also a <script> tag with various functions containing information I would like to use.

My question is: Would StreamSets be the best tool for scraping data from Websites, or should I attempt to use a third-party tool first, to then be ingested by my StreamSets pipeline?