Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
- You have a table
order_lines
with 100,000,000 rows
- The table has a column named
customer_id
- You have 1,000,000 distinct customers
- Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id
is skewed. On average, you'd expect each distinct customer_id
to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines
based on a non-unique index on customer_id
and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP
operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.