0
votes

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI. My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog). Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI. But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...

Is it an issue of some APIs not authorized? Thanks for your help!

1
Look at here in order to validate again your fileset full path in a particular query.Nick_Kh
Thanks but clicking on "Query fileset" in the Details Panel, BQ automatically fills the Query editor with a fully-qualified, dot-separated list of identifiers that follow the Standard SQL lexical structure. So this does not seem to me the issue.Giorgio Rivero
I'm getting the same, created the fileset, can add it from the 'Add Data' menu, see the schema, click the "Query Fileset" and it says table not found. I note that the documentation does say it only supports CSV files without Headers (which needs a switch to ignore) I wonder if the data doesn't match the schema and format requirements?Scott Hamilton

1 Answers

0
votes

You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:

gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."

When you are building your DataFlow SQL query:

SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
  INNER JOIN    
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
  ON tr.state = sr.state_code

Check the full path, it should look like the example above:

datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions

let me know if this works for you.