Ask Your Question

Revision history [back]

I would recommend starting by working through the tutorial - this will help you understand how the pieces fit together - and it shows how to write data to Hadoop. The key principle is that origins (such as HTTP Client) parse the incoming data into in-memory records; processors act on these records, and destinations format the records and write to the data store. Your pipeline will probably be the HTTP Client origin set to JSON format and the Hadoop FS destination set to Delimited data format. As I mentioned before, you'll need to flatten any hierarchical structure in your incoming records. This blog post is a good guide to transforming data in the pipeline: Transform Data in StreamSets Data Collector.