0
votes

I would like to ask what rules does GreenPlum separate it's tables into segments.

Does it evenly separate data into segments or according to block size?

Or depends on other characteristic of the data.

Thanks!

2

2 Answers

1
votes

Two different ways.

  1. Distribution key

Example:

CREATE TABLE foo
(id int, bar text)
DISTRIBUTED BY (id);

This will spread the data the id column. You should pick a column or set of columns that will spread the data evenly across the database. If the table is very large and you join it to another table that is also very large, you may want to distribute both tables by the same keys.

  1. Random

Example:

CREATE TABLE foo
(id int, bar text)
DISTRIBUTED RANDOMLY;

This distributes the data in a random fashion. Use this for small tables or if there isn't a natural key to the table.

You can also see how the distribution by using the hidden column "gp_segment_id".

select gp_segment_id, count(*) from foo group by gp_segment_id order by gp_segment_id;
 gp_segment_id | count 
---------------+-------
             0 |  1654
             1 |  1655
             2 |  1665
             3 |  1661
             4 |  1682
             5 |  1683
0
votes

Your data is evenly distributed if there are enough different values in your distribution key. A primary key or a unique key are good candidates for that.

If your distribution key has only a few distinct values, each row with the same value ends up on the same segment. That's an uneven distribution, some segments have to do all the work and others will be idle.

Example: if you chose gender as distribution key, you (usually) end up with 2-3 values: female/male/unknown. Let's leave other possible values aside here. In this scenario you have 2-3 distinct values, and 2-3 different segment databases will hold all rows. Worst case: these 3 segment databases are all on the same physical host.