18
votes

I am using Docker Apache airflow VERSION 1.9.0-2 (https://github.com/puckel/docker-airflow).

The scheduler produces a significant amount of logs, and the filesystem will quickly run out of space, so I am trying to programmatically delete the scheduler logs created by airflow, found in the scheduler container in (/usr/local/airflow/logs/scheduler)

I have all of these maintenance tasks set up: https://github.com/teamclairvoyant/airflow-maintenance-dags

However, these tasks only delete logs on the worker, and the scheduler logs are in the scheduler container.

I have also setup remote logging, sending logs to S3, but as mentioned in this SO post Removing Airflow task logs this setup does not stop airflow from writing to the local machine.

Additionally, I have also tried creating a shared named volume between the worker and the scheduler, as outlined here Docker Compose - Share named volume between multiple containers. However, I get the following error in worker:

ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler'

and the following error in scheduler:

ValueError: Unable to configure handler 'file.processor': [Errno 13] Permission denied: '/usr/local/airflow/logs/scheduler/2018-04-11'

And so, how do people delete scheduler logs??

4

4 Answers

10
votes

Inspired by this reply, I have added the airflow-log-cleanup.py DAG (with some changes to its parameters) from here to remove all old airflow logs, including scheduler logs.

My changes are minor except that given my EC2's disk size (7.7G for /dev/xvda1), 30 days default value for DEFAULT_MAX_LOG_AGE_IN_DAYS seemed too large so (I had 4 DAGs) I changed it to 14 days, but feel free to adjust it according to your environment:

DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 30) changed to DEFAULT_MAX_LOG_AGE_IN_DAYS = Variable.get("max_log_age_in_days", 14)

4
votes

Following could be one option to resolve this issue.

Login to the docker container using following mechanism

#>docker exec -it <name-or-id-of-container> sh

While running above command make sure - container is running.

and then use cron jobs to configure scheduled rm command on those log files.

4
votes

This answer to "Removing Airflow Task logs" also fits your use case in Airflow 1.10.

Basically, you need to implement a custom log handler and configure Airflow logging to use that handler instead of the default (See UPDATING.md, not README nor docs!!, in Airflow source repo)

One word of caution: Due to the way logging, multiprocessing, and Airflow default handlers interact, it is safer to override handler methods than to extend them by calling super() in a derived handler class. Because Airflow default handlers don't use locks

-1
votes

I spent a lot of time trying to add "maintenance" DAGs that would clear logs generated by the different airflow components started as Docker containers.

The problem was in fact more at the Docker level, each one of those processes are responsible of tons of logs that are, by default, stored in json files by Docker. The solution was to change the logging drivers so that logs are not stored on the Docker hosting instance anymore; but sent directly to AWS CloudWatch Logs in my case.

I just had to add the following to each service in the docker-compose.yml file (https://github.com/puckel/docker-airflow) :

    logging:
      driver: awslogs
      options:
        awslogs-group: myAWSLogsGroupID

Note that the EC2 instance on which my "docker-composed" Airflow app is running has an AWS role that allows her to create a log stream and add log events (CreateLogStream and PutLogEvents actions in AWS IAM service).

If you run it on a machine outside of the AWS ecosystem, you'd need to ensure it has access to AWS through credentials.