0
votes

I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:

You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.

That leads me to some questions?

1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?

2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?

3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.

Extra questions:

1) Does the leader node has records?

1

1 Answers

0
votes

To clarify a few things:

  • The Distribution Key is not a hint -- data is actually distributed according to the key
  • When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
  • The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)

The only thing you really need to know about the Optimizer is:

Query optimizer

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.

From Data Warehouse System Architecture:

Leader node

The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.

The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).

For information on Distribution Styles, refer to: Distribution Styles

If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.