Ask Your Question

How can I load large volumes of data to Redshift?

asked 2017-05-17 19:52:02 -0500

Tuple gravatar image

I would like to load a lot of data (millions of rows) to Redshift but inserts using the Redshift JDBC driver are very slow, taking many seconds per batch and sometimes causing locking issues.

How can I reliably load data to Redshift using StreamSets in a high-performance manner?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2017-05-17 20:08:55 -0500

Tuple gravatar image

updated 2017-05-18 23:59:16 -0500

Amazon Redshift is not designed to be a direct recipient of streaming data ingest and will have significant performance issues with a large number of small commits. You can INSERT and UPDATE data to Redshift using the Redshift JDBC driver, but doing a large amount of small commits to a Redshift table will take a very long time and will fail/block a lot.

However, there are a couple of alternative high-performance ways to load data into Redshift using StreamSets.

Option #1

Amazon's preferred method for loading data to Redshift is using the COPY command. This requires that you place a file in S3 and then run a SQL command to tell Redshift a bunch of parameters about what table to load the data to. Not ANSI SQL, it's a Redshift-specific thing. More info here on the COPY command:

What you could do is write a pipeline to place files in an S3 bucket, and then use an Event chained off of the S3 destination to trigger a COPY command via a JDBC Query executor. You can even parameterize the COPY command to the file that has been loaded by passing the values generated by the Event on the S3 destination. You can also pass in functions into a SQL script using the normal EL function notation, something like this:

INSERT INTO table1 (
user_nm, etl_load_id) SELECT stg.user_nm ,${record:value("/etlload_id")} FROM
stagetable stg

Option #2

If you really need to stream data into Redshift and don't want to have to trigger your own COPY command, then Amazon recommends that you stream to Kinesis Firehose. StreamSets has a Kinesis destination that you can write to.

Firehose internally buffers, writes files out to S3 buckets (wherever you specify) and then periodically runs COPY commands for you to load into Redshift, usually every 5 minutes or so.

One drawback of this is if your Redshift cluster is marked as "Not Publicly Accessible", then Kinesis cannot currently connect to it. More detail here, look halfway down under "VPC Access to an Amazon Redshift Cluster":

In my current environment we have a lot of sensitive data in the Redshift cluster and it's not acceptable to make the cluster Publicly Accessible. So we can't use Kinesis, we have to use option #1 above and handle loading ourselves.

Both options have more pros and cons, but this is a quick overview.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools



Asked: 2017-05-17 19:52:02 -0500

Seen: 2,918 times

Last updated: May 18 '17