1
votes

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.

The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?

The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?

I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!

3
Please display your code. - SphynxTech
This is a really broad question and there are many ways to implement what you're talking about. You are generally asking about ETL (Extract, Transform, Load), so I would advise searching books and docs on that. - Dan Kowalczyk
Also, since you're new to SO, you may not know that it's more likely to get answers if you keep your questions focused. I see it more that a focused question will have a general answer than the other way around where a general question gets a focused answer. - Dan Kowalczyk
please accept my answer or whichever is the best answer below - Jon Scott
Stored Procedures are now supported in Amazon Redshift from version 1.0.7287 (late April 2019). Please review the document "Creating Stored Procedures in Amazon Redshift" for more information on getting started with stored procedures. - Joe Harris

3 Answers

2
votes

ETL

Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)

You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:

`psql <authentication stuff> -c 'insert into z select a, b, from x'`

(Use psql v8, upon which Redshift was based.)

Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.

Machine Learning

Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.

Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.

Lots of options, indeed!

1
votes

The 2 options for running ETL on Redshift

  1. Create some "create table as" type SQL, which will take your source tables as input and generate your target (transformed table)
  2. Do the transformation outside of the database using an ETL tool. For example EMR or Glue.

Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).

Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.

There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.

0
votes

Basic transforming of existing data

It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:

  1. You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order to talk with Redshift from any workstation of you LAN (make sure your IP has proper privilege)
  2. You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation logic.
  3. Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here (https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
  4. Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does

Best way to host my python logic

It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:

  1. Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
  2. If your result set are relatively smaller you can directly save them into the target redshift table
  3. If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.

Hope this helps.