Speed up row scanning for large datasets in Spanner

Question

When issuing a simple query in spanner using primary index of the interleaving table for a split having millions of records then it takes long time to scan the table.

For instance SELECT COUNT(*) FROM foo WHERE foo_key="bar" where foo_key is the primary index of the interleaving table. It scans 3,000,000 rows and it takes up to 40 seconds to resolve (please note that this question does not restrict to simple COUNT query, but any query where table scan is the bottleneck).

I'm thinking of that case in BigQuery where it will use multiple processes that get merged to speed up requests.What sounds confusing in spanner is that in the execution plan we can see that spanner performs a table scan with a single execution, then a distributed union of the rows. That makes me think it could use multiple processes as well.

Are there any ways for the scan process to be spread across multiple executions to speed up the table scan?

RedPandaCurios RedPandaCurios · Accepted Answer · 2019-09-09T13:56:39

Cloud Spanner does not perform well as an analytics database - as you have seen - and so full table scans are not recommended.

Depending on your query, you may be able to limit to a key-range or multiple key-ranges to reduce the number of rows scanned, possibly in conjunction with an index...

I am not sure I understand your example query though:

SELECT COUNT(*) FROM foo WHERE foo_key="bar"

(where foo_key is the primary index.)

This should only read a single row -- as you are specifying the primary key - and return either 1 or 0 depending on if the key exists.

Speed up row scanning for large datasets in Spanner

2 Answers