0
votes

I have a simple proof-of-concept application written in traditional SQL. I need to scale it to much larger size (potentially trillions of rows, multiple terabytes or possibly petabytes in size). I'm trying to come up with the pricing model of how this could be done using Google's Bigtable/BigQuery/Dataflow.

From what I gather from Google's pricing documents, Bigtable is priced in terms of nodes needed to handle the necessary QPS and in terms of storage required, whereas the BigQuery is priced in terms of each query's size.

But what happens when your inserts into the table actually require the lookup of that same table? Does that mean that you have to consider an additional cost factor into each insert? If my total column size is 1TB and I have to do a SELECT on that column before each additional insert, will I be charged $5 for each insert operation as a consequence? Do I have to adjust my logic to accommodate this pricing structure? Like breaking the table into a set of smaller tables, etc?

Any clarification much appreciated, as well as links to more detailed and granular pricing examples for Bigtable/BigQuery/Dataflow than what's available on Google's website.

3

3 Answers

0
votes

I am the product manager for Google Cloud Bigtable.

It's hard to give a detailed answer without a deeper understanding of the use case. For example, when you need to do a lookup before doing an insert, what's the complexity of the query? Is it an arbitrary SQL query, or can you get by with a lookup by primary key? How big is the data set?

If you only need to do lookups by key, then you may be able to use Bigtable (which, like HBase, only has a single key: the row key), and each lookup by row key is fast and does not require scanning the entire column.

If you need complex lookups, you may be able to use:

  • Google BigQuery, but note that each lookup on a column is a full scan as per this answer, though as suggested in another answer, you can partition data to scan less data, if that's helpful

  • Google Cloud Datastore, which is a document database (like MongoDB), allows you to set up indexes on some of the fields, so you can do a search based on those properties

  • Google Cloud SQL, which is a managed service for MySQL, but while it can scale to TB, it does not scale to PB, so it depends how big your dataset is that you need to query prior to inserting

Finally, if your use case is going into the PB-range, I strongly encourage you to get in touch with Google Cloud Platform folks and speak with our architects and engineers to identify the right overall solution for your specific use cases, as there may be other optimizations that we can make if we can discuss your project in more detail.

0
votes

Regarding BigQuery, you are able to partition your data based on day. So if you need to query only last days the charge will be for that and not for full table.

On the other hand, you need to rethink your data management. Choosing an append-only and event based data flow could help you to avoid lookups on the same table.

0
votes

will I be charged $5 for each insert operation as a consequence?

Yes, any time you scan that column - you will be charged for full column's size unless your result is cachable (see query caching) which most likely is not your case

Do I have to adjust my logic ... ?

Yes.
"breaking the table into a set of smaller tables" (Sharding with Table wildcard functions) or Partitioning is the way to go for you