1
votes

I am using a Development Instance of Google Cloud Bigtable with Python client google-cloud-happybase package.

For development purposes: -My table has 56.5k rows with 18 columns.

-My table has 1 column family

-Average sized content of each row element is 9.5 bytes.

-Row keys are on average ~ 35 bytes

-The row keys are balanced.

When I use scan() function on my table I get a generator which can be used to get the contents of each row key. Whenever I read the contents from the generator, I do not have timing consistency for example:

samp = table.scan(columns = ['sample_family:ContactId'])
for i in range(56547):
    start_time = timeit.default_timer()
    samp.next()
    elapsed = timeit.default_timer() - start_time
    append_list.append(elapsed)

-The median time to call the next() is 4.05e-06 seconds

-The max time to call the next() is .404 seconds with several calls that take at least 0.1 seconds.

-The total time to call the next() on all the elements in the generator is 2.173 seconds because of the outliers and would ideally take (4.05e-06)* 56,547 ~ .229 seconds given that the distribution of times was normally distributed.

Obviously there are several outliers that throw off the performance.

My question is why am I seeing this type of performance as it doesn't align with the metrics found here: https://cloud.google.com/bigtable/docs/performance

My thoughts are that since the workload is significantly less than < 300 GB, Bigtable might not be able to balance data that optimizes performance for smaller data sets as compared to larger sets.

Also even though my Development instance is using 1 node with the 17.1MB I feel this should not be an issue.

-I was wondering if anyone could give me insights to the problems/issues encountered and what possible steps to remedy the situation.

1

1 Answers

1
votes

Cloud Bigtable's Read API is a streaming API. Each response in the stream is a set of rows. Sometimes, you need to wait for the next response, but most of the time you get rows that are already in memory. Here are some additional things to consider

  • The first response will always be slow, because the server side is batch up a response.

  • The API reads rows sequentially. You can gain performance enhancements by parallelizing the requests. In Java, I would get the regions to figure what start/stop keys should be used for the set of scans. Unfortunately, Table.region() is not currently available in Python, so I raised a bug to fix that.

  • FYI, I am the author of the Cloud Bigtable Java client. I added some performance optimizations to prefetch additional responses. We need to compare the python client's speed to the Java client's speed. If you are comfortable with Java, please consider doing this same test with that client to compare performance.

I hope this helps.