1
votes

I'm investigating the feasibility of using BigQuery to store sensor data in time series. The intent is to store the data in BQ and process it in Pandas... so far so good... Pandas can interpret a TIMESTAMP field index and create a Series.

An additional requirement is that the data support arbitrary tags as key/value pairs (e.g. job_id=1234, task_id=5678). BigQuery can support this nicely with REPEATED fields of type RECORD:

                   {'fields':
                       [
                           {
                               "mode": "NULLABLE",
                               "name": "timestamp",
                               "type": "TIMESTAMP"
                           },
                           {
                               "mode": "REPEATED",
                               "name": "tag",
                               "type": "RECORD",
                               "fields":
                               [
                                    {
                                        "name":"name",
                                        "type":"STRING"
                                    },
                                    {
                                        "name":"value",
                                        "type":"STRING"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_1",
                                        "type": "FLOAT"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_2",
                                        "type": "FLOAT"
                                    },
                                    {
                                        "mode": "NULLABLE",
                                        "name": "measurement_3",
                                        "type": "FLOAT"
                                    },
                                ]
                            },
                       ]
                   }

This works great for storing the data and it even works great for querying if I only need to filter on a single key/value combination

SELECT measurement_1 FROM measurements 
WHERE tag.name = 'job_id' AND tag.value = '1234'

However, I also need to be able to combine sets of tags in query expressions and I can't seem to make this work. For example this query returns no result

SELECT measurement_1 FROM measurements 
WHERE tag.name = 'job_id' AND tag.value = '1234'
      AND tag.name = 'task_id' AND tag.value = '5678'

Questions: Is it possible to formulate a query to do what I want using this schema? What is the recommended way to attach this type of variable data to an otherwise fixed schema in Big Query?

Thanks for any help or suggestions!

Note: If you're thinking this looks like a great fix for InfluxDB it's because that's what I've been using thus far. The seemingly insurmountable issue is the amount of series cardinality in my data set, so I'm looking for alternatives.

2
Why tag python, pandas, and time series? - Kartik
Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see meta.stackexchange.com/questions/5234/… for why it is important. Also important to vote on answer. Vote up answers that are helpful. There are more ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. - Mikhail Berlyant

2 Answers

1
votes

BigQuery Legacy SQL

SELECT measurement_1 FROM measurements 
OMIT RECORD IF
  SUM((tag.name = 'job_id' AND tag.value = '1234')
   OR (tag.name = 'task_id' AND tag.value = '5678')) < 2

BigQuery Standard SQL

SELECT measurement_1 FROM measurements 
WHERE (
  SELECT COUNT(1) FROM UNNEST(tag) 
  WHERE ((name = 'job_id' AND value = '1234')
      OR (name = 'task_id' AND value = '5678'))
) >= 2
0
votes

Repeated are great way for storing data series, collection etc.
In order to filter out from repeated fields just the value of one interest I would use the following template

SELECT 
    MAX( IF( filter criteria,  value_to_pull, null)) WITHIN RECORD AS some_name
FROM <table>

In your case it would be the following:

SELECT
  MAX(IF(tag.name = 'job_id' AND tag.value = '1234', measurement_1, NULL)) WITHIN RECORD AS job_1234_meassurement_1,
  MAX(IF(tag.name = 'task_id' AND tag.value = '5678', measurement_1, NULL)) WITHIN RECORD AS task_5678_meassurement_1,
  FROM measurements