3
votes

We are in the process of migrating our data warehouse from Oracle to Redshift. Currently we have two instances of Oracle database - one DW instance (Primary) gets data loaded from different sources throughout the day and another DW (Secondary) instance replicating the data from the primary DW. All reporting platforms point to the Secondary DW instance. How can we address this in Redshift? Should we need to have two instances of Redshift one replicating from the other? If we have just one Redshift instance will the data load overhead affects the query performance. Will there be table locks issue?

Appreciate your suggestions. Thanks.

2

2 Answers

0
votes

It really depends how quickly your reporting platforms need access to the data that is loaded throughout the day. If it can wait, then it makes sense to batch load during quiet hours. I suspect from the fact that you're using replication in your current setup, that you require the data to be loaded and available as soon as possible.

In that case, it would make sense to utilise Redshift's Workload Management (WLM) settings. This allows you to designate multiple workload groups, and allocate a concurrency level and cluster resource allocation to each. Using this model, you can ring-fence resources to ensure your query performance for your reporting tools and end-users is guaranteed a consistent allocation of resources, while still dedicating a portion of the cluster's query queue and resources to your data loads.

This would also eliminate the need for having two separate database instances to handle loading and serving data.

See here for more detail on WLM in Redshift: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

0
votes

Never ever Read and Write from the same instance. Not even in Redshift. Even in general, any system that forces you to read and write from the same machine reflects a poor design.

Since you are discussing about Amazon Redshift, I can very comfortably assume that you have analytical data. (Redshift having a columnar architecture is optimised for reads and not write. So if you happen to store transactional data on Redshift, I would recommend you to reconsider your decision).

Before designing any infra about we talk about analytical data, we should always consider that:

  1. It'll be voluminous.
  2. and it'll be further scaled in the near future.

When you scale, reading and writing from the same machine will be catastrophic. And not to forget the locks. Delete / Truncate will hold Exclusive Locks on the table. If it happens that some other process user has already acquired this lock, then even the write on that table will fail, messing up the data.

The above reasons might be convincing enough on why not to use a single warehouse to read / write data.

Follow the below model, which is neat and clean, will never interfere and will ensure you don't face issues of consistency and locks etc:

 +
 |
 |
 |  DS 1     +------------+            +------------+
 +---------> |            |            |            |
             |            | AGGREGATES |            |     reads
    DS 2     |   DW 1     +----------> |    DW 2    | +----------->
+----------> |            |            |            |
             |            |            |            |
+----------> +------------+            +------------+
|... DS n
|
+
where DS : Data Source , DW : Data Warehouse

The migration of data from DW 1 --> DW 2 will completely depend upon how frequent data you have to refer.