1
votes

We need to store weather forecast data in a database so that it can be queried by latitude/longitude from an API.

The forecast data comes natively as a GRIB2 file, which is a type of georeferenced raster format with multiple bands. The GRIB2 file can be converted to csv, which causes the file size to increase by 20-40x. If storing the csv in a database, the data is about 35GB and contains these columns:

0,rt,timestamp
1,vt,timestamp
2,param,string
3,level,string
4,lon,float
5,lat,float
6,value,float

This data will change every hour and needs to be re-ingested into the database. This means that the ingest cannot, under any circumstances, take longer than an hour (and would preferably be a lot shorter).

On top of this, 30 days-worth of previous forecast data must be stored in another database table (or tables), and also made accessible to be queried through an API. 30 days-worth of forecast data is 720 hours-worth of ingests, which is 720 * 35gb = 25.2 TB. Each hour the 721st oldest hour must be removed, and the data transferred from the latest forecast table to the archive table(s).

I have looked into multiple managed google database solutions (BigQuery, Cloud Spanner, Datastore, BigTable). So far it seems the BigTable pricing structure is most appropriate for the purpose of this API.

However, it seems that I cannot query the data by column value, let alone by two column values (latitude and longitude). Is this the case? Is there any way to structure the data to get around this limitation? If so, how would I query it?

If BigTable is the wrong tool for the job I would appreciate a recommendation for a more appropriate service.

1
Can you please share more details about the type of queries you want to perform on these data ? Considering your case BigTable seems the right tool because of his amazing ingestion speed, however you need to carefully design tables and indexes to make it really powerful. I see also the possibility of using BigQuery if you don't need to expose really old data but just want to preserve them for analysis, you can use it's data expiration feature to make data disappear after a while too.Pievis

1 Answers

1
votes

What you want to do is possible, however, you have to keep in mind the two main restrictions of BigTable key-row design:

  1. Each table has only one index, the row key. There are no secondary indices.
  2. Rows are sorted and queried lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian, or network, byte order, the binary equivalent of alphabetical order.

Depending on your queries, you will have a DB design or another, you will have to find a compromise between the design and the queries.

In this case, you have to map the (latitude, longitude) to a single key, mainly you can do:

  • Concatenate latitude to altitude: key-row: longitude#latitude.
    • Similar latitudes on the same longitude are stored together.
  • Concatenate altitude to latitude: key-row: latitude#longitude
    • Similar longitudes on the same altitude are stored together.
  • Places that are near stored near: known as space-filling function: key-row: Integer

Example of a table with 30 days worth of data with a unique column stats (from Cloud Shell):

cbt createinstance my-instance "My instance" my-instance-c1 europe-west1-b 3 SSD
cbt createtable weather-ts "families=stats:maxage=30d||maxversions=31"

Set values as a CSV(all string) to the key 123123:

cbt set weather-ts 123123 stats:value='FIRST_CSV'
cbt set weather-ts 123123 stats:value='SECOND_CSV'

See the values stored:

cbt read  weather-ts

My output:

2020/01/06 10:29:37 -creds flag unset, will use gcloud credential
----------------------------------------
123123
  stats:value                              @ 2020/01/06-10:29:35.093000
    "SECOND_CSV"
  stats:value                              @ 2020/01/06-10:29:33.224000
    "FIRST_CSV"
----------------------------------------

Bigtable automatically compresses text so the total storage usage might be less that what you expected.