3
votes

Objective

I want to deploy Airflow on Kubernetes where pods have access to the same DAGs, in a Shared Persistent Volume. According to the documentation (https://github.com/helm/charts/tree/master/stable/airflow#using-one-volume-for-both-logs-and-dags), it seems I have to set and pass these values to Helm: extraVolume, extraVolumeMount, persistence.enabled, logsPersistence.enabled, dags.path, logs.path.

Problem

Any custom values I pass when installing the official Helm chart results in errors similar to:

Error: YAML parse error on airflow/templates/deployments-web.yaml: error converting YAML to JSON: yaml: line 69: could not find expected ':'
  • Works fine: microk8s.helm install --namespace "airflow" --name "airflow" stable/airflow
  • Not working:
microk8s.helm install --namespace "airflow" --name "airflow" stable/airflow \
--set airflow.extraVolumes=/home/*user*/github/airflowDAGs \
--set airflow.extraVolumeMounts=/home/*user*/github/airflowDAGs \
--set dags.path=/home/*user*/github/airflowDAGs/dags \
--set logs.path=/home/*user*/github/airflowDAGs/logs \
--set persistence.enabled=false \
--set logsPersistence.enabled=false
  • Also not working: microk8s.helm install --namespace "airflow" --name "airflow" stable/airflow --values=values_pv.yaml, with values_pv.yaml: https://pastebin.com/PryCgKnC
    • Edit: Please change /home/*user*/github/airflowDAGs to a path on your machine to replicate the error.

Concerns

  1. Maybe it is going wrong because of these lines in the default values.yaml:
## Configure DAGs deployment and update
dags:
  ##
  ## mount path for persistent volume.
  ## Note that this location is referred to in airflow.cfg, so if you change it, you must update airflow.cfg accordingly.
  path: /home/*user*/github/airflowDAGs/dags

How do I configure airflow.cfg in a Kubernetes deployement? In a non-containerized deployment of Airflow, this file can be found in ~/airflow/airflow.cfg.

  1. Line 69 in airflow.cfg refers to: https://github.com/helm/charts/blob/master/stable/airflow/templates/deployments-web.yaml#L69

Which contains git. Are the .yaml wrongly configured, and it falsely is trying to use git pull, but since no git path is specified, this fails?

System

  • OS: Ubuntu 18.04 (single machine)
  • MicroK8s: v1.15.4 Rev:876
  • microk8s.kubectl version: v1.15.4
  • microk8s.helm version: v2.14.3

Question

How do I correctly pass the right values to the Airflow Helm chart to be able to deploy Airflow on Kubernetes with Pods having access to the same DAGs and logs on a Shared Persistent Volume?

2
Meaby instead of trying to --set extraVolume and extraVolumeMount change it in values.yaml? Have you tried do it that way? github.com/helm/charts/blob/master/stable/airflow/…Jakub
@jt97 Yes, that was my other attempt. It's the bullet point with values_pv.yaml in the question (pastebin.com/PryCgKnC). I assume that renaming the values.yaml has no impact on the functionality.NumesSanguis
@NumesSanguis were you able to setup the volume mount?alltej
@alltej I didn't have time for my project anymore and it actually didn't require Kubernetes, so I went with a simpler solution. Also, Helm 3.0 has been released, which means the answer would likely change. I hope I have a chance to try again in the future.NumesSanguis

2 Answers

3
votes

Not sure if you have this solved yet, but if you haven't I think there is a pretty simple way close to what you are doing.

All of the Deployments, Services, Pods need the persistent volume information - where it lives locally and where it should go within each kube kind. It looks like the values.yaml for the chart provides a way to do this. I'll only show this with dags below, but I think it should be roughly the same process for logs as well.

So the basic steps are, 1) tell kube where the 'volume' (directory) lives on your computer, 2) tell kube where to put that in your containers, and 3) tell airflow where to look for the dags. So, you can copy the values.yaml file from the helm repo and alter it with the following.

  1. The airflow section

First, you need to create a volume containing the items in your local directory (this is the extraVolumes below). Then, that needs to be mounted - luckily putting it here will template it into all kube files. Once that volume is created, then you should tell it to mount dags. So basically, extraVolumes creates the volume, and extraVolumeMounts mounts the volume.

airflow:
  extraVolumeMounts: # this will get the volume and mount it to that path in the container                                                                                                                                                               
  - name: dags
    mountPath: /usr/local/airflow/dags  # location in the container it will put the directory mentioned below.

  extraVolumes: # this will create the volume from the directory
  - name: dags
    hostPath:
      path: "path/to/local/directory"  # For you this is something like /home/*user*/github/airflowDAGs/dags

  1. Tell the airflow config where the dags live in the container (same yaml section as above).
airflow:
  config:
    AIRFLOW__CORE__DAGS_FOLDER: "/usr/local/airflow/dags"  # this needs to match the mountPath in the extraVolumeMounts section
  1. Install with helm and your new values.yaml file.
helm install --namespace "airflow" --name "airflow" -f local/path/to/values.yaml stable/airflow

In the end, this should allow airflow to see your local directory in the dags folder. If you add a new file, it should show up in the container - though it may take a minute to show up in the UI - I don't think the dagbag process is constantly running? Anyway, hope this helps!

0
votes

Do it with yaml file

So if we think about using values.yaml, there is a problem because You edited it the wrong way.

extraVolumeMounts: home/*user*/github/airflowDAGs
  ## Additional volumeMounts to the main containers in the Scheduler, Worker and Web pods.
  # - name: synchronised-dags
  #   mountPath: /usr/local/airflow/dags
  extraVolumes: home/*user*/github/airflowDAGs
  ## Additional volumes for the Scheduler, Worker and Web pods.
  # - name: synchronised-dags
  #   emptyDir: {}

You can't just pass path like that if extraVolumeMounts need name and mounthPath to work, that's the reason you have # there, so You can just delete them,add your values and its should work.

It should look like this

 extraVolumeMounts:
 - name: synchronised-dags
   mountPath: /usr/local/airflow/dags
 extraVolumes:
 - name: synchronised-dags
   emptyDir: {}

That's the way You can install it:

1.Use helm fetch to download airflow chart to your pc

helm fetch stable/airflow --untar

2.Edit airflow/values.yaml extraVolumeMount and extraVolume like in example above,just add your name and path.

nano/vi/vim airflow/values.yaml

3.You can either change rest things in airflow/values.yaml and use:

helm install ./airflow --namespace "airflow" --name "airflow" -f ./airflow/values.yaml

OR

use this command with just extraVolumeMount and extraVolume edited

helm install --set dags.path=/home/user/github/airflowDAGs/dags --set logs.path=/home/user/github/airflowDAGs/logs --set persistence.enabled=false --set logsPersistence.enabled=false  ./airflow --namespace "airflow" --name "airflow" -f ./airflow/values.yaml

Result:

NAME:   airflow
LAST DEPLOYED: Fri Oct 11 09:18:46 2019
NAMESPACE: airflow
STATUS: DEPLOYED

RESOURCES:
==> v1/ConfigMap
NAME                  DATA  AGE
airflow-env           20    2s
airflow-git-clone     1     2s
airflow-postgresql    0     2s
airflow-redis         3     2s
airflow-redis-health  3     2s
airflow-scripts       1     2s

==> v1/Deployment
NAME               READY  UP-TO-DATE  AVAILABLE  AGE
airflow-flower     0/1    1           0          1s
airflow-scheduler  0/1    1           0          1s
airflow-web        0/1    1           0          1s

==> v1/PersistentVolumeClaim
NAME                STATUS   VOLUME    CAPACITY  ACCESS MODES  STORAGECLASS  AGE
airflow-postgresql  Pending  standard  2s

==> v1/Pod(related)
NAME                                 READY  STATUS             RESTARTS  AGE
airflow-flower-5596b45d58-wrg74      0/1    ContainerCreating  0         1s
airflow-postgresql-75bf7d8774-dxxjn  0/1    Pending            0         1s
airflow-redis-master-0               0/1    ContainerCreating  0         1s
airflow-scheduler-8696d66bcf-bwm2s   0/1    ContainerCreating  0         1s
airflow-web-84797489f5-8wzsm         0/1    ContainerCreating  0         1s
airflow-worker-0                     0/1    Pending            0         0s

==> v1/Secret
NAME                TYPE    DATA  AGE
airflow-postgresql  Opaque  1     2s
airflow-redis       Opaque  1     2s

==> v1/Service
NAME                    TYPE       CLUSTER-IP   EXTERNAL-IP  PORT(S)   AGE
airflow-flower          ClusterIP  10.0.7.168   <none>       5555/TCP  1s
airflow-postgresql      ClusterIP  10.0.8.62    <none>       5432/TCP  2s
airflow-redis-headless  ClusterIP  None         <none>       6379/TCP  1s
airflow-redis-master    ClusterIP  10.0.8.5     <none>       6379/TCP  1s
airflow-web             ClusterIP  10.0.10.176  <none>       8080/TCP  1s
airflow-worker          ClusterIP  None         <none>       8793/TCP  1s

==> v1/ServiceAccount
NAME     SECRETS  AGE
airflow  1        2s

==> v1/StatefulSet
NAME            READY  AGE
airflow-worker  0/1    1s

==> v1beta1/Deployment
NAME                READY  UP-TO-DATE  AVAILABLE  AGE
airflow-postgresql  0/1    1           0          1s

==> v1beta1/PodDisruptionBudget
NAME         MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
airflow-pdb  N/A            1                0                    2s

==> v1beta1/Role
NAME     AGE
airflow  2s

==> v1beta1/RoleBinding
NAME     AGE
airflow  2s

==> v1beta2/StatefulSet
NAME                  READY  AGE
airflow-redis-master  0/1    1s


NOTES:
Congratulations. You have just deployed Apache Airflow
   export POD_NAME=$(kubectl get pods --namespace airflow -l "component=web,app=airflow" -o jsonpath="{.items[0].metadata.name}")
   echo http://127.0.0.1:8080
   kubectl port-forward --namespace airflow $POD_NAME 8080:8080

2. Open Airflow in your web browser