0
votes

I would like to run airflow DAGs on databricks.

I have installed apache-airflow 1.9.0 (python3 package) on databricks.

In databricks notebook, I used :

  %sh
  airflow list_dags

I got:

 -------------------------------------------------------------------
 DAGS
 -------------------------------------------------------------------
 example_bash_operator
 example_branch_dop_operator_v3
 example_trigger_target_dag
 example_xcom
 latest_only
 latest_only_with_trigger
 test_utils
 tutorial

I would like to visualize the above DAGs as graph view.

I can do this by installing airflow docker image on my local machine and then visit localhost:8080.

But, I cannot find out how to do this on databricks.

Thanks

UPDATE I have run

  %sh
  airflow webserver -p 8080

I have tried to access localhost:8080 by running

  %sh
  curl localhost:8080

on databricks notebook.

I got:

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  
 Current
                             Dload  Upload   Total   Spent    Left  
 Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  100   221  100   221    0     0  23440      0 --:--:-- --:--:-- --:--:-- 24555
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
  <title>Redirecting...</title>
  <h1>Redirecting...</h1>
  <p>You should be redirected automatically to target URL: <a href="/admin/">/admin/</a>.  If not click the link.

But, they are on the output of notebooks and no clickable links on it. The airflow is installed on databricks cluster not on my local machine. If I run localhost:8080 on my local machine, I cannot access it.

thanks

2

2 Answers

1
votes
  1. Regarding to your update about using curl to retrieve the information provided by the airflow webserver, you can use the code below (basically, you need to get the url inside admin directory)

    %sh curl localhost:8080/admin/

Although, you won't be able to popup any airflow UI page as you'd do from your local machine (the curl will only show the output on plain text in the databricks console).

  1. Clusters in Databricks are mostly temporary resources (in the way that they are viewed only as "processing tool" which can be started, stoped, restarted and terminated in any time, without affecting the data stored), so they are not expected to run all the time due to a webserver process. Also, Databricks provides an "abstraction" about the clusters, because it is not expected to access directly to the Databricks cluster's nodes (ie: the specific IP of each node). You can eventually enable SSH on nodes (as explained here: https://docs.databricks.com/clusters/configure.html#ssh-access-to-clusters) and then, open a node to internet in order to access it from the URL UI (in case of using you should use the server IP). But, in fact, unless the Databricks platform provides a UI for an specific service (ie: MLFlow, Delta), it isn't recommended to open those IPs to internet (due to possible Security Leaks).

  2. In the Databricks with Airflow integration, the main idea is that you have an external airflow master node (where do you have the webserver process running), from where you connect to the Databricks cluster in order to execute the jobs (via the DatabricksSubmitRunOperator which internally executes the Databricks REST Api). As mentioned in this link, previously posted by @CHEEKATLAPRADEEP-MSFT: https://docs.databricks.com/dev-tools/data-pipelines.html
    In fact, it is not expected (up to now) to run and keep the airflow webserver process running from Databricks clusters (this will consume resources). So It wouldn't mind to access the webserver inside the Databricks. It is important that the job is run into the Databricks using the airflowOperators, instead.

Hope this helps to answer your question

0
votes

You can visualize the DAG in the Airflow web UI. Run airflow webserver and connect to localhost:8080. Click on any example_databricks_operator to see many visualizations of your DAG.

Here is an example:

enter image description here

Reference: Integrating Apache Airflow with Databricks.

Hope this helps. Do let us know if you any further queries.


Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.