Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: “Table not found: datacatalog.entry”

Question

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI. My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog). Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI. But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...

Is it an issue of some APIs not authorized? Thanks for your help!

Look at here in order to validate again your fileset full path in a particular query. — Nick_Kh
Thanks but clicking on "Query fileset" in the Details Panel, BQ automatically fills the Query editor with a fully-qualified, dot-separated list of identifiers that follow the Standard SQL lexical structure. So this does not seem to me the issue. — Giorgio Rivero
I'm getting the same, created the fileset, can add it from the 'Add Data' menu, see the schema, click the "Query Fileset" and it says table not found. I note that the documentation does say it only supports CSV files without Headers (which needs a switch to ignore) I wonder if the data doesn't match the schema and format requirements? — Scott Hamilton

mesmacosta mesmacosta · Accepted Answer · 2020-05-14T15:08:48

You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:

gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."

When you are building your DataFlow SQL query:

SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
  INNER JOIN    
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
  ON tr.state = sr.state_code

Check the full path, it should look like the example above:

datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions

let me know if this works for you.

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: “Table not found: datacatalog.entry”

1 Answers