0
votes
SELECT
  a.id,
  b.url as codingurl 
FROM fact_A a 
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
  • Records Count in Fact_A: 2 Million
  • Records Count in Dim_B : 1500
  • Time Taken to Execute : 10 Mins
  • No of Nodes: 2

Could someone help me with an understanding why the above query takes more time to execute?

We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.

Dim_B table is created with DISTRIBUTION ALL.

1
can you give more info, specifically what kind of nodes you are using, how many nodes in your cluster and table schema of A and Bketan vijayvargiya
What do you mean by "why the above query takes more time to execute"? More time than what? Or are you just saying that you want it to be faster? Remember -- it is having to do 3 billion string comparisons, and strpos is not an efficient way to join tables compared to normal comparisons (eg =, <, >=).John Rotenstein
Can you provide some sample values for a.url and b.url that are being used with strpos? There might be a more efficient way to do the query.John Rotenstein

1 Answers

0
votes

Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.

Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.

You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).