1
votes

I am developing an IoT data pipeline using Python and Bigtable, and writes are desperately slow.

I have tried both Python client libraries offered by Google. The native API implements a Row class with a commit method. By iteratively committing rows in that way from my local development machine, the write performance on a production instance with 3 nodes is roughly 15 writes / 70 KB per second --granted, the writes are hitting a single node because of the way my test data is batched, and the data is being uploaded from a local network... However Google represents 10,000 writes per second per node and the upload speed from my machine is 30 MB/s, so clearly the gap lies elsewhere.

I have subsequently tried the happybase API with much hope because the interface provides a Batch class for inserting data. However, after disappointingly hitting the same performance limit, I realized that the happybase API is nothing more than a wrapper around the native API, and the Batch class simply commits rows iteratively in very much the same way as my original implementation.

What am I missing?

1
You're really not missing anything. There is work underway to support Cloud Bigtable's bulk mutation API in the python client: github.com/GoogleCloudPlatform/google-cloud-python/issues/2411. The other advice I can give you is to do as much work in parallel as possible. Multiple threads/processes will let you scale linearly for quite a while given the performance you're seeing so far. - Gary Elliott
@GaryElliott thank you for the reassurance and guidance! I've implemented a thread pool and I do get linear improvements but it tapers off at ~15 threads, yielding 200 writes/seconds. Beyond that no improvement. Is that what you would expect, and if so why is there still such a gap from the purported performance? - JD Margulici
No that load wouldn't tax bigtable much, I would start to suspect a client-side/application bottleneck at this point (locking?) Unless all your writes are going to the same row, in which case the writes would be serialized in bigtable. - Gary Elliott
Are you directly writing to Bigtable or using something like OpenTSDB to abstract time series for you? If the former, please read the time series schema design docs and post more information about your schema; you could be using row key schema which puts all writes on a single node. Also, if you are doing a large batch load into an empty table, you should pre-split your table for faster performance; Bigtable will take over from there, but you need initial splits to distribute key ranges. - Misha Brukman

1 Answers

2
votes

I know I'm late to this question, but for anyone else who comes across this, the google cloud libraries for python now allow bulk writes with mutations_batcher. Link to the documentation.

You can use batcher.mutate_rows and then batcher.flush to send all rows to be updated in one network call, avoiding the iterative row commits.