0
votes

We just started using Apache airflow in our project for our data pipelines .While exploring the features came to know about configuring remote folder as log destination in airflow .For that we

Created a google cloud bucket. From Airflow UI created a new GS connection

enter image description here

I am not able to understand all the fields .I just created a sample GS Bucket under my project from google console and gave that project ID to this Connection.Left key file path and scopes as blank. Then edited airflow.cfg file as follows

remote_base_log_folder = gs://my_test_bucket/
remote_log_conn_id = test_gs

After this changes restarted the web server and scheduler .But still my Dags is not writing logs to the GS bucket .I am able to see the logs which is creating logs in base_log_folder .But nothing is created in my bucket . Is there any extra configuration needed from my side to get it working

Note: Using Airflow 1.8 .(Same issue I faced with AmazonS3 also. )

Updated on 20/09/2017

Tried the GS method attaching screenshot

enter image description here Still I am not getting logs in the bucket

Thanks Anoop R

3

3 Answers

1
votes

I advise you to use a DAG to connect airflow to GCP instead of UI.

First, create a service account on GCP and download the json key.

Then execute this DAG (you can modify the scope of your access):

from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator

def add_gcp_connection(ds, **kwargs):

      """Add a airflow connection for GCP"""

     new_conn = Connection(
           conn_id='gcp_connection_id',
           conn_type='google_cloud_platform',
     )
     scopes = [
          "https://www.googleapis.com/auth/pubsub",
          "https://www.googleapis.com/auth/datastore",
          "https://www.googleapis.com/auth/bigquery",
          "https://www.googleapis.com/auth/devstorage.read_write",
          "https://www.googleapis.com/auth/logging.write",
          "https://www.googleapis.com/auth/cloud-platform",
     ]
     conn_extra = {
          "extra__google_cloud_platform__scope": ",".join(scopes),
          "extra__google_cloud_platform__project": "<name_of_your_project>",
          "extra__google_cloud_platform__key_path": '<path_to_your_json_key>'
}
     conn_extra_json = json.dumps(conn_extra)
     new_conn.set_extra(conn_extra_json)
     session = settings.Session()
     if not (session.query(Connection).filter(Connection.conn_id == 
      new_conn.conn_id).first()):
         session.add(new_conn)
         session.commit()
    else:
         msg = '\n\tA connection with `conn_id`={conn_id} already exists\n'
         msg = msg.format(conn_id=new_conn.conn_id)
         print(msg)

 dag = DAG('add_gcp_connection', start_date=datetime(2016,1,1), schedule_interval='@once')

# Task to add a connection
AddGCPCreds = PythonOperator(
       dag=dag,
       task_id='add_gcp_connection_python',
       python_callable=add_gcp_connection,
       provide_context=True)

Thanks to Yu Ishikawa for this code.

0
votes

Yes, you need to provide additional information for both, S3 and GCP connection.

S3

Configuration is passed via extra field as JSON. You can provide only profile

{"profile": "xxx"}

or credentials

{"profile": "xxx", "aws_access_key_id": "xxx", "aws_secret_access_key": "xxx"}

or path to config file

{"profile": "xxx", "s3_config_file": "xxx", "s3_config_format": "xxx"}

In case of the first option, boto will try to detect your credentials.

Source code - airflow/hooks/S3_hook.py:107

GCP

You can either provide key_path and scope (see Service account credentials) or credentials will be extracted from your environment in this order:

  • Environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to a file with stored credentials information.
  • Stored "well known" file associated with gcloud command line tool.
  • Google App Engine (production and testing)
  • Google Compute Engine production environment.

Source code - airflow/contrib/hooks/gcp_api_base_hook.py:68

0
votes

The reason for logs not being written to your bucket could be related to service account rather than config on airflow itself. Make sure it has access to the mentioned bucket. I had same problems in the past.

Adding more generous permissions to the service account, e.g. even project wide Editor and then narrowing it down. You could also try using gs client with that key and see if you can write to the bucket.

For me personally this scope works fine for writing logs: "https://www.googleapis.com/auth/cloud-platform"