1
votes

I am trying to run a Jupyter Notebook within another, in Databricks.

The code below fails, the error is 'df3 is not defined'. But, df3 is defined.

input_file = pd.read_csv("/dbfs/mnt/container_name/input_files/xxxxxx.csv")
df3 = input_file
%run ./NotebookB

The first line of code in NotebookB is below (all Markdowns are shown in Databricks with no issues):

df3.iloc[:,1:] = df3.iloc[:,1:].clip(lower=0)

I do not get such an error in my Jupyter Notebook, e.g. the code below works:

input_file = pd.read_csv("xxxxxx.csv")
df3 = input_file
%run "NotebookB.ipynb"

Basically, it seems as if when running NotebookB in Databricks, the definition of df3 is not used or forgotten, leading to the 'not defined' error.

Both Jupyter Notebooks belong in the same Workspace folder in Databricks.

2

2 Answers

0
votes

I see you want to pass the structured data like DataFrame from an Azure Databricks Notebook to the other one by calling.

Please refer to the offical document Notebook Workflows to know how to use functions dbutils.notebook.run and dbutils.notebook.exit to do that.

Here is the sample code in Python from the section Pass structured data of the offical document above.

%python

# Example 1 - returning data through temporary tables.
# You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can
# return a name referencing data stored in a temporary table.

## In callee notebook
sqlContext.range(5).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))

So for passing the pandas dataframe in your code, you need to first convert the pandas dataframe to pyspark dataframe by using spark.createDataFrame function as below.

df3 = spark.createDataFrame(input_file)

Then to pass it by the code below.

df3.createOrReplaceGlobalTempView("df3")
dbutils.notebook.exit("df3")

Meanwhile, to change the roles of NotebookA and NotebookB and to call NotebookA as callee from NotebookB as caller.

0
votes

In notebook A, save the df to csv and call notebook B passing as an argument the path to the csv. notebook B reads from the path, does some operation, and overwrites the csv. notebook A reads from the same path, now with the desired result.

An example:


notebook A (caller)

# write df to /path/test-csv.csv
df = spark.range(10)
df.write.csv(path = '/path/test-csv.csv')
df.show()

# call notebook B with the csv path /path/test-csv.csv
nb = "/path/notebook-b"
dbutils.notebook.run(nb, 60, {'df_path': 'dbfs:/path/test-csv.csv'})

# now that the transf has completed [err-handling-here], read again from the same path
spark.read.format("csv").load('dbfs:/path/test-csv.csv').show()

output:

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

+---+---+
|_c0|_c1|
+---+---+
|  0|0.0|
|  1|2.0|
|  2|4.0|
|  3|6.0|
|  4|8.0|
+---+---+

notebook B (callee)

# crt var for path
dbutils.widgets.text("df_path", '/', 'df-test')
df_path = dbutils.widgets.get("df_path")

# read from path
df = spark.read.format("csv").load(df_path)

# execute whatever operation
df = df.withColumn('2x', df['_c0'] * 2)

# overwrite the transf ds to the same path
df.write.csv(path = df_path, mode = "overwrite")

dbutils.notebook.exit(0)