0
votes

I'm using databricks with python on Azure to process my data.the result of this process will be saved as csv file on azure blob storage.

but here's the problem. when the result file is more than 750 Mb an error occurred.

after some research on google, I knew that I have to increase my Scala.rc.message.maxSize, and I did that. the problem is the maximum size that I can set is only 2Gb, and as I using databricks to analyze big data, I do expecting a file much more than 2 Gb.

the question is:

  1. is 2 Gb is really the maximum message size which are supported on Azure Databricks? I tried to search and go through the official document from Microsoft but cannot find any information regarding this.

  2. is there any way for me to increase the value? or even set it to scalable depends on my data.

here is my python code for these process.

#mount azure storage to my databricks
dbutils.fs.mount(
  source = "wasbs://mystoragecontainer.blob.core.windows.net",
  mount_point = "/mnt/test3",
  extra_configs = {"fs.azure.account.key.mystoragecontainer.blob.core.windows.net":dbutils.secrets.get(scope = "myapps", key = "myappskey")})


#define saving process in a function
def save_data(df, savefile):
  df.coalesce(1).write.mode("overwrite").options(header="true").format("com.databricks.spark.csv").save(savefile)
  res = savefile.split('/')
  ls_target = savefile.rstrip(res[-1])
  dbutils.fs.ls(savefile+"/")
  fileList = dbutils.fs.ls(savefile+"/")
  target_name = ""
  for item in fileList:
    if item.name.endswith("csv"):
      filename= item.path
      target_parts = filename.split('/')
      target_name = filename.replace('/'+target_parts[-2]+'/', '/')
      print(target_name)
      dbutils.fs.mv(filename, ls_target)
    else:
      filename= item.path
      dbutils.fs.rm(filename, True)
  dbutils.fs.rm(savefile, True)
  dbutils.fs.mv(target_name, savefile)

# call my save function
save_data(df,"dbfs:/mnt/test3/myfolderpath/japanese2.csv")

any information would be appreciated.

bests,

1
What have you tried so far ? There are limitations on the dbfs with Databricks Runtime 5.5 and below which supports only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. That being said, you should be able to write much larger files in Python to a mounted storage. Can you share some code you tried ? - Axel R.
Hi @AxelR. thank you for your response. I already mounted my azure storage but the same problem persist. added my code into my first post, please check it for your reference. - oRoberto

1 Answers

1
votes

If I understand correctly, you want to merge the distributed csv generated by :

df.coalesce(1).write.mode("overwrite").options(header="true").format("com.databricks.spark.csv").save(savefile) 

I would suggest you try to convert it into a pandas dataframe and write into a single csv like below :

# call my save function
df.toPandas().to_csv("/dbfs/mnt/test3/myfolderpath/japanese2.csv")

This should write a single csv containing all the data in your dataframe. Be careful to use /dbfs/ when using Pandas as it uses the file API instead of the DBFS API.

Also, this is pySpark, not really scala.