1
votes

This question is an extension of this question: BigQuery check entire table for null values

I'm very new to BigQuery. I have a large table and would like to avoid ~150 ifnull calls. However, some of the unnested columns names are the same, and they aren't being handled properly using the answer to the above question. Here's an example of the data fields

itemId
extension.new.name
extension.new.location
extension.old.name
extension.default
extension.category
...

The solution to the linked question is:

#standardSQL
SELECT col_name, COUNT(1) nulls_count
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(\w+)":null')) col_name
GROUP BY col_name 

and the result I get is somewhat as follows:

col_name    nulls_count
name        5
location    8
default     3
category    7
...

But this solution treats both extension.new.name and extension.old.name the same due to the group by, but I'd like to have a null count for each individually:

col_name                nulls_count
extension.new.name      5
extension.new.location  8
extension.old.name      10
extension.default       3
extension.category      7

Is it possible to get the nested column name with the unnested count? I'm not sure if there is a way to capture the entire name and concatenate them?

2

2 Answers

2
votes

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION xxx(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
  result = [];
  processKey(JSON.parse(input), '');
  function processKey(node, parent) {
    if (parent !== '') {parent += '.'};
    Object.keys(node).map(function(key) {
      if(node[key] == null) {value = null;} else {value = node[key].toString();};
      if (value !== '[object Object]') {
        result.push(parent + key + '|' + value); // ({type:r.type, key:parent + key, value:value});
      } else {
        processKey(node[key], parent + key);
      };
    });         
  };
  return result;
'''; 
SELECT COUNT(1) nulls_count, col_name
FROM (
  SELECT  
    SPLIT(col, '|')[OFFSET(0)] col_name,
    SPLIT(col, '|')[OFFSET(1)] col_value
  FROM `project.dataset.table` t, 
  UNNEST(xxx(TO_JSON_STRING(t))) col
)
WHERE col_value = 'null'
GROUP BY col_name     

You might wanted to adjust for your specific needs/expectations
I tested with few public datasets and it looked more-less ok.

For example for bigquery-public-data.google_analytics_sample.ga_sessions_20170801 table - output was

Row nulls_count col_name     
1   2556    visitorId    
2   1246    totals.timeOnSite    
3   2513    totals.transactions  
4   2513    totals.transactionRevenue    
5   2556    totals.screenviews   
6   2556    totals.uniqueScreenviews     
7   2556    totals.timeOnScreen  
8   2513    totals.totalTransactionRevenue   
9   1907    trafficSource.referralPath   
10  2551    trafficSource.keyword    
11  2544    trafficSource.adContent  
12  2556    trafficSource.adwordsClickInfo.campaignId    
13  2556    trafficSource.adwordsClickInfo.adGroupId     
14  2556    trafficSource.adwordsClickInfo.creativeId    
15  2556    trafficSource.adwordsClickInfo.criteriaId    
16  2498    trafficSource.adwordsClickInfo.page  
17  2498    trafficSource.adwordsClickInfo.slot  
18  2498    trafficSource.adwordsClickInfo.gclId     
19  2556    trafficSource.adwordsClickInfo.customerId    
20  2498    trafficSource.adwordsClickInfo.adNetworkType     
21  2551    trafficSource.adwordsClickInfo.targetingCriteria     
22  2498    trafficSource.adwordsClickInfo.isVideoAd     
23  1684    trafficSource.isTrueDirect   
24  2556    trafficSource.campaignCode   
25  2556    device.javaEnabled   
26  1231    hits.0.isSecure  
27  1230    hits.0.page.searchKeyword    
28  1230    hits.0.page.searchCategory   
29  611     hits.0.transaction.transactionId     
30  611     hits.0.transaction.transactionRevenue    
31  611     hits.0.transaction.transactionTax    
32  611     hits.0.transaction.transactionShipping   
33  611     hits.0.transaction.affiliation   
34  611     hits.0.transaction.localTransactionRevenue   
35  611     hits.0.transaction.localTransactionTax   
36  611     hits.0.transaction.localTransactionShipping  
37  611     hits.0.transaction.transactionCoupon     
38  611     hits.0.item.transactionId    
39  611     hits.0.item.productName  
40  611     hits.0.item.productCategory  
41  611     hits.0.item.productSku   
42  611     hits.0.item.itemQuantity     
43  611     hits.0.item.itemRevenue  
44  611     hits.0.item.localItemRevenue     
45  1231    hits.0.contentInfo   
46  1231    hits.0.appInfo.name  
47  1231    hits.0.appInfo.version   
48  1231    hits.0.appInfo.id    
49  1231    hits.0.appInfo.installerId   
50  1231    hits.0.appInfo.appInstallerId    
51  1231    hits.0.appInfo.appName   
52  1231    hits.0.appInfo.appVersion    
53  1231    hits.0.appInfo.appId     
54  1231    hits.0.exceptionInfo.description     
55  1231    hits.0.exceptionInfo.exceptions  
56  1231    hits.0.exceptionInfo.fatalExceptions     
57  1231    hits.0.eventInfo     
58  1015    hits.0.promotionActionInfo   
59  1231    hits.0.refund    
60  1231    hits.0.eCommerceAction.option    
61  1231    hits.0.publisher     
62  1231    hits.0.social.socialInteractionNetwork   
63  1231    hits.0.social.socialInteractionAction    
64  1231    hits.0.social.socialInteractions     
65  1231    hits.0.social.socialInteractionTarget    
66  1231    hits.0.social.uniqueSocialInteractions   
67  1231    hits.0.latencyTracking   
68  1231    hits.0.sourcePropertyInfo    
69  1230    hits.0.contentGroup.contentGroupUniqueViews1     
70  1221    hits.0.contentGroup.contentGroupUniqueViews3     
71  1231    hits.0.contentGroup.contentGroupUniqueViews4     
72  1231    hits.0.contentGroup.contentGroupUniqueViews5     
73  2556    userId   
74  2556    clientId     
75  303     hits.0.referer   
76  684     totals.newVisits     
77  620     hits.0.contentGroup.contentGroupUniqueViews2     
78  620     hits.0.transaction   
79  620     hits.0.item  
80  22      hits.0.product.0.productRevenue  
81  22      hits.0.product.0.localProductRevenue     
82  22      hits.0.product.0.productQuantity     
83  22      hits.0.product.0.productRefundAmount     
84  22      hits.0.product.0.localProductRefundAmount    
85  22      hits.0.product.0.isClick     
86  22      hits.0.product.0.productCouponCode   
87  5       trafficSource.adwordsClickInfo.targetingCriteria.boomUserlistId  
88  216     hits.0.promotionActionInfo.promoIsClick  
89  1318    totals.bounces   
0
votes

Regarding your issue, I would like to state that it is happening due to the use of:

 `UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(\w+)":nu`ll'))

It returns only the last col_name's part (after the last "."), which results in duplicated col_name and, thus just one is considered in the COUNT(1) after the group by.

However, it is possible to change the approach and evaluate the strings after each ".". In your case, you would evaluate two expressions:

  • In case they have these two levels as in: extension.new.name.
  • Also, if just one level name, as in: extension.default.

I used a simple dummy data below to reproduce your case.

WITH data AS (
SELECT 'extension.new.name' as col_name UNION ALL
SELECT 'extension.new.location' as col_name UNION ALL
SELECT 'extension.new.location' as col_name UNION ALL
SELECT 'extension.old.name' as col_name UNION ALL
SELECT 'extension.default' as col_name UNION ALL
SELECT 'extension.category ' as col_name 
)
SELECT COALESCE(REGEXP_EXTRACT(col_name,r'(\.\w*\.\w*)'),REGEXP_EXTRACT(col_name,r'(\.\w*)')) as col_name1 from data
group by col_name1

Notice that I deliberately wrote a column with the exact same name just to check the accuracy. In addition, I used the built-in method COALESCE(), which returns the first non null expression, together with REGEXP_EXTRACT(), which will:

  1. Search for expressions with two levels as in: extension.new.name.
  2. Search for expressions with two levels as in: extension.default.

Lastly, the data is grouped by new field col_name1 and the output is as follows:

enter image description here

Therefore, in you script the COUNT(1) will be considered for all above names.