0
votes

I have a task in an airflow dag that requires 100 GB in RAM to successfully complete. I have 3 nodes with 50 GB Memory each in Composer environment. I have 3 workers ( one running on each node ). The issue here is, this task is running only on one of the workers( max Memory it can use is 50 GB), therefore it is failing because of memory issues.

Is there way to make this task use memory from all the nodes (150 GB) ? (Assume we can't the split the task into smaller steps)

Also, in cloud composer, can we make a worker span across multiple nodes? (If so, I can force one worker to run on all three nodes and use 150 GB memory)

2
"make a worker span across multiple nodes", you are talking about distributed computing, this is not as simple as you write it to be. This is why frameworks like Hadoop and Spark was created. Perhaps you can write your job in Dataflow instead since that will scale your task into multiple workers, and use Dataflow operator.cryanbhu
like the others mentioned, if you want to brute force it you can use Airflow to spin up a single instance in GCE with enough ram, do the task, and then spin down the instancecryanbhu

2 Answers

1
votes

If a single DAG is resource-intensive enough to exhaust an entire Composer node's resources, then more nodes will not help unless co-scheduled workers are the problem. So the possible solution is to create a new Cloud Composer environment with a larger machine type than the current machine type, please refer to the public documentation.

High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.

You can check it with following steps:

  1. In the Cloud Console, navigate to Kubernetes Engine -> Workloads
  2. Click on airflow-worker, and look under Managed pods
  3. If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
1
votes

If less than 128 GB, then you could start a Compute Engine, using the various operators to your cause https://airflow.apache.org/docs/stable/howto/operator/gcp/compute.html