1
votes

I was trying to convert parquet file to excel file. However, when I am trying to do so, using pandas or openpyxl engine, it is showing "Operation not supported" error. However, I can read excel file using openpyxl engine in databricks.

While Reading the below code is working:

xlfile = '/dbfs/mnt/raw/BOMFILE.xlsx'
tmp_csv = '/dbfs/mnt/trusted/BOMFILE.csv'
pdf = pd.DataFrame(pd.read_excel(xlfile, engine='openpyxl'))
pdf.to_csv (tmp_csv, index = None, header=True)

However, when I tried to write the same using openpyxl as well as xlswriter, it is not working:

parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(final, engine='openpyxl')
#pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)

Error I got:

FileCreateError: [Errno 95] Operation not supported

OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
FileCreateError                           Traceback (most recent call last)
<command-473603709964454> in <module>
     17       final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
     18       print(outfile)
---> 19       pandas_df.to_excel(outfile, engine='openpyxl')
     20       #pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)

/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
   2179             startcol=startcol,
   2180             freeze_panes=freeze_panes,
-> 2181             engine=engine,
   2182         )
   2183 

Please suggest.

1

1 Answers

3
votes

The problem is that there are limitations when it comes to the local file API support in DBFS (the /dbfs fuse). For example, it doesn't support random writes that are required for Excel files. From documentation:

Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs.

In your case it could be:

from shutil import copyfile

parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
temp_file = '/tmp/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(temp_file, engine='openpyxl')

copyfile(temp_file, final)

P.S. You can also use dbutils.fs.cp to copy file (doc) - it will also work on Community Edition where the /dbfs isn't supported