1
votes

How can I join two tables in a select statement in which I also use a UDF? I stored the SQL query and UDF function in two files that I call via the bq command line. However, when I run it, I get the following error:

BigQuery error in query operation: Error processing job '[projectID]:bqjob_[error_number]': Table name cannot be resolved: dataset name is missing.

Note that I'm logged in the correct project via the gcloud auth method. My SQL statement:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
FROM
  defaultChannelGroup(
    SELECT
      a.date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId,
      a.visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign
    FROM FLATTEN(
      SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ,hits) as a
    LEFT JOIN FLATTEN(
      SELECT hits.page.pagePath,hits.time,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ,hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

where I replaced my datasetname with the correct name of course; and some of the UDF (which works with another query):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      date: row.date
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  //['device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'date', 'type': 'string'}
],
  defaultChannelGroup
);
1
I haven't been able to reproduce. (Some BigQuery team members could also look at the log if you leave your full job id too) - Felipe Hoffa
Thanks @FelipeHoffa. I reran it this morning via the following command: bq query --udf_resource=Desktop/bq.js "$(cat Desktop/bq-sd-mkt-channels.sql)" and got the same error message, i.e.: bqjob_r324a276c6f5130bc_000001596949a59f_1': Table name cannot be resolved: dataset name is missing - kekchoze

1 Answers

1
votes

The select statements within the FLATTEN function needed to be in brackets.

Ran the bq command in the shell: bq query --udf_resource=udf.js "$(cat query.sql)"

query.sql contains the following scripts:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
  COUNT(DISTINCT transactionId) as orders,
  CAST(SUM(transactionRevenue)/1000000 AS INTEGER) as sales
FROM
  defaultChannelGroup(
    SELECT
      a.date as date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId as fullVisitorId,
      a.visitId as visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign,
      a.hits.transaction.transactionRevenue as transactionRevenue,
      a.hits.transaction.transactionID as transactionId
    FROM FLATTEN((
      SELECT  date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID,
              hits.transaction.transactionID, hits.transaction.transactionRevenue
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ),hits) as a
    LEFT JOIN FLATTEN((
      SELECT hits.page.pagePath,hits.time,trafficSource.medium,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ),hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

and udf.js contains the following function (the 'computeDefaultChannelGroup' function is not included):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      date: row.date,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      transactionId: row.transactionId,
      transactionRevenue: row.transactionRevenue,
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId', 'transactionId', 'transactionRevenue'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'date', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'transactionId', 'type': 'string'},
  {'name': 'transactionRevenue', 'type': 'integer'}
],
  defaultChannelGroup
);

Ran without error and matched the data in Google Analytics.