0
votes

I'm new with Airflow and I'm currently stuck on an issue with the Bigquery operator. I'm trying to execute a simple query on a table from a given dataset and copy the result on a new table in the same dataset. I'm using the bigquery operator to do so, since according to the doc the 'destination_dataset_table' parameter is supposed to do exactly what I'm looking for (source:https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/bigquery_operator/index.html#airflow.contrib.operators.bigquery_operator.BigQueryOperator).

But instead of copying the data, all I get is a new empty table with the schema of the one I'm querying from.

Here's my code

default_args = {
    'owner':'me',
    'depends_on_past':False,
    'start_date':datetime(2019,1,1),
    'end_date':datetime(2019,1,3),
    'retries':10,
    'retry_delay':timedelta(minutes=1),
}

dag = DAG(
    dag_id='my_dag',
    default_args=default_args,
    schedule_interval=timedelta(days=1)
)

copyData = BigQueryOperator(
    task_id='copyData',
    dag=dag,
    sql=
        "SELECT some_columns,x,y,z FROM dataset_d.table_t WHERE some_columns=some_value",
    destination_dataset_table='dataset_d.table_u',
    bigquery_conn_id='something',
)

I don't get any warnings or errors, the code is running and the tasks are marked as success. It does create the table I wanted, with the columns I specified, but totally empty.

Any idea what I'm doing wrong?

EDIT: I tried the same code on a much smaller table (from 10Gb to a few Kb), performing a query with a much smaller result (from 500Mb to a few Kb), and it did work this time. Do you think the size of the table/the query result matters? Is it limited? Or does performing a too large query cause a lag of some sort?

EDIT2: After a few more tests I can confirm that this issue is not related to the size of the query or the table. It seems to have something to do with the Date format. In my code the WHERE condition is actually checking if a date_column = 'YYYY-MM-DD'. When I replace this condition with an int or string comparison it works perfectly. Do you guys know if Bigquery uses a particular date format or requires a particular syntax?

EDIT3: Finally getting somewhere: When I cast my date_column as a date (CAST(date_column AS DATE)) to force its type to DATE, I get an error that says that my field is actually an int-32 (Argument type mismatch). But I'm SURE that this field is a date, so that implies that either Bigquery stores it as an int while displaying it as a date, or that the Bigquery operator does some kind of hidden type conversion while loading tables. Any idea on how to fix this?

1
The example you are giving seems correct, and I would suspect that the query is giving nothing as result. Have you tried copying the rendered sql (in the ui) and run that in BigQuery? - judoole
Yes I did, but don't worry I found the answer, it was the legacy_sql parameter, it seems like it is set to True by default, I just turned it off and everything worked :) Lost so much time on such a stupid thing x) - Armaaj
You're not the first @armaaj ;) - judoole

1 Answers

0
votes

I had a similar issue when transferring data from other data sources than big-query.

I suggest casting the date_column as follows: to_char(date_column, 'YYYY-MM-DD') as date

In general, I have seen that big-query auto detection schema is often problematic. The safest way is to always specify schema before executing its corresponding query, or use operators that support schema definition.