0
votes

I have some data in Google BigTable over which I have built a BigQuery external table (as per Querying Cloud Bigtable data so that I can query the Bigtable table using conventional SQL (which I'm very familiar with).

When I issue a select * I get this: enter image description here

Now I would like to know the syntax for querying specific values in this nested data. For example, to get a list of accountIds I can do this:

SELECT  ARRAY(SELECT timestamp FROM UNNEST(attributes.column[OFFSET(0)].cell)) AS timestamp,
        ARRAY(SELECT SAFE_CONVERT_BYTES_TO_STRING(value) FROM UNNEST(attributes.column[OFFSET(0)].cell)) AS values
FROM `table` 
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'

which returns:
enter image description here

which is, well, kinda handy.

Similarly I can get car#le11mcr#policyStartDate by changing the OFFSET like so:

SELECT  ARRAY(SELECT timestamp FROM UNNEST(attributes.column[OFFSET(6)].cell)) AS timestamp,
        ARRAY(SELECT SAFE_CONVERT_BYTES_TO_STRING(value) FROM UNNEST(attributes.column[OFFSET(6)].cell)) AS values
FROM `table` 
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'

enter image description here

However both of these queries require me to know what value to pass to OFFSET() and that value appears to depend on the alphabetical order of the Bigtable columns hence if another column whose name starts with (say) 'b' appears in the future my queries would no longer return the same thing.

I need a better way of querying the table than using OFFSET(). Essentially I want to be able to say:

select the cell values and timestamp values for the cell whose name is accountId

or

select the cell values and timestamp values for the cell whose name is car#le11mcr#policyStartDate

Is there a way to do that? I'm not too familiar with BigQuery syntax for doing this.

1

1 Answers

0
votes

OK, I've made a tiny bit of progress.

This:

SELECT  
    (
    select array(select timestamp from unnest(cell)) 
    from unnest(attributes.column) where name in ('accountId')
    ) accountIdTimestamp,
    (
    select array(select value from unnest(cell)) 
    from unnest(attributes.column) where name in ('accountId')
    ) accountIdValue
FROM `table` 
where SAFE_CONVERT_BYTES_TO_STRING(rowkey) = 'XXXX'
limit 3

returns:
enter image description here

which is better, but notice it didn't return anything for the first two rows. That's because those two rows don't have a cell called accountId, a problem I can get around by introducing a WHERE clause:

SELECT  
    (
    select array(select timestamp from unnest(cell)) 
    from unnest(attributes.column) where name in ('accountId')
    ) accountIdTimestamp,
    (
    select array(select value from unnest(cell)) 
    from unnest(attributes.column) where name in ('accountId')
    ) accountIdValue
FROM `table` 
where ARRAY_LENGTH(ARRAY(
    select name from unnest(attributes.column) where name in ('accountId')
    )) > 0
limit 3

which returns:
enter image description here

That does what I want, I guess, but I'd like to think there's a better way of achieving this that doesn't require quite so much typing and so much complicated logic (the WHERE clause in particular feels like a very complicated way of saying only give me rows if there's an accountId).

Any advice to make this more efficient or readable would be appreciated.

My next challenge to solve is to return the accountIdValue for the max(accountIdTimestamp)