2
votes

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:

CREATE TABLE predictions (
    name STRING(MAX),
    ...,
    model_version INT64,
) PRIMARY KEY (name, model_version)

I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:

DELETE FROM predictions WHERE model_version <> ? 

According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.

# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
    "DELETE FROM predictions WHERE model_version <> @version",
    params={"model_version": 104},
    param_types={"model_version": Type(code=INT64)},
)

My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/

I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:

def delete_old_rows(transaction, model_version):
    delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
    dml_statements = [
        delete_dml,
    ]
    status, row_counts = transaction.batch_update(dml_statements)


database.run_in_transaction(delete_old_rows,
    model_version=104,
    timeout_secs=3600,
)

What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.

Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance

3
I also read the code you linked and don't understand why the deadline is not respected as the code should populate the timeout_secs from the keyword paramter dictionary and set the deadline. I will internally follow up with that issue. Thank you for reporting!yongchul

3 Answers

2
votes

The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.

We will update the docs for run_in_transaction to explain this better.

2
votes

The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)

This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.

As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)

final RetrySettings retrySettings =
    RetrySettings.newBuilder()
        .setInitialRpcTimeout(Duration.ofMinutes(60L))
        .setMaxRpcTimeout(Duration.ofMinutes(60L))
        .setMaxAttempts(1)
        .setTotalTimeout(Duration.ofMinutes(60L))
        .build();
SpannerOptions.Builder builder =
    SpannerOptions.newBuilder()
        .setProjectId("[PROJECT]"));
builder
    .getSpannerStubSettingsBuilder()
    .applyToAllUnaryMethods(
        new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
          @Override
          public Void apply(Builder<?, ?> input) {
            input.setRetrySettings(retrySettings);
            return null;
          }
        });
builder
    .getSpannerStubSettingsBuilder()
    .executeStreamingSqlSettings()
    .setRetrySettings(retrySettings);
builder
    .getSpannerStubSettingsBuilder()
    .streamingReadSettings()
    .setRetrySettings(retrySettings);

Spanner spanner = builder.build().getService();
1
votes

The first suggestion is to try gcloud instead.

https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml

Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.

Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.