Cannot import CSV file into h2o from Databricks cluster DBFS

Question

I have successfully installed both h2o on my AWS Databricks cluster, and then successfully started the h2o server with:

h2o.init()

When I attempt to import the iris CSV file that is stored in my Databricks DBFS:

train, valid = h2o.import_file(path="/FileStore/tables/iris.csv").split_frame(ratios=[0.7])

I get an H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException

The CSV file is absolutely there; in the same Databricks notebook, I am able to read it directly into a DataFrame and view the contents using the exact same fully qualified path:

df_iris = ks.read_csv("/FileStore/tables/iris.csv")
df_iris.head()

I've also tried calling:

h2o.upload_file("/FileStore/tables/iris.csv")

but to no avail; I get H2OValueError: File /FileStore/tables/iris.csv does not exist. I've also tried uploading the file directly from my local computer (C drive), but that doesn't succeed either.

I've tried not using the fully qualified path, and just specifying the file name, but I get the same errors. I've read through the H2O documentation and searched the web, but cannot find anyone who has ever encountered this problem before.

Can someone please help me?

Thanks.

Alex Ott Alex Ott · Accepted Answer · 2020-12-21T10:37:39

H2O may not understand that this path is on the DBFS. You may try to specify path /dbfs/FileStore/tables/iris.csv - in this case it will be read as "local file", or try to specify the full path with schema, like dbfs:/FileStore/tables/iris.csv - but this may require DBFS-specific jars for H2O.

Cannot import CSV file into h2o from Databricks cluster DBFS

1 Answers