0
votes

I am trying to direct insert sparkstream data into an Amazon Redshift cluster but am not able to find right way.

Below is the code that i got but its first inserting into S3 then copying to Redshift:.

#REDSHIFT_JDBC_URL = "jdbc:redshift://%s:5439/%s" % (REDSHIFT_SERVER, DATABASE)

df.write \
    .format("com.databricks.spark.redshift") \
    .option("url", REDSHIFT_JDBC_URL) \
    .option("dbtable", TABLE_NAME) \
    .option("tempdir", "s3n://%s:%s@%s" % (ACCESS_KEY, SECRET, S3_BUCKET_PATH)) \
    .mode("overwrite") \
    .save()

Does it impact streaming or insertion performance?

Or any other way to do it?

1
what is the source of your streaming data? are you using spark? - Jon Scott
Source is kafka .Use case is to get the data from source and process the data & insert it into redshiftdb. - user3774905

1 Answers

2
votes

The Redshift data source for Spark that you are using writes data to Redshift by writing the data to a temporary file on S3 then uses the Redshift COPY command to load the data into Redshift. The COPY command requires the data to be in a file on S3, this is to most efficient way to load data into Redshift so what your code is currently doing appears to be correct.

See also: https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html