Pyspark saving is not working when called from inside a foreach

Question

I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables.

All my tests with static data went well, see the code below:

body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n"
tableLocation = "/delta/tables/myTableName"

spark = SparkSession.builder.appName("CSV converter").getOrCreate()    
csvData = spark.sparkContext.parallelize(body.split('\n'))

df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)

df.write.format("delta").mode("append").save(tableLocation)

However in my case, each eventhub message is a CSV string, and they may come from many sources. So each message must be processed separatelly, because each message may end up saved in different delta tables.

When I try to execute this same code inside a foreach statement, It doesn't work. There are no errors shown at the logs, and I cant find any table saved.

So maybe I am doing something wrong when calling the foreach. See the code below:

def SaveData(row):
   ...
   The same code above

dfEventHubCSV.rdd.foreach(SaveData)

I tried to do this on a streaming context, but I sadly went through the same problem.

What is in the foreach that makes it behave different?

Below the full code I am running:

import pyspark.sql.types as t
from pyspark.sql import SQLContext

--row contains the fields Body and SdIds
--Body: CSV string
--SdIds: A string ID 
def SaveData(row):

  --Each row data that is going to be added to different tables
  rowInfo = GetDestinationTableData(row['SdIds']).collect()  
  table = rowInfo[0][4]
  schema = rowInfo[0][3]
  database = rowInfo[0][2]     
  body = row['Body']

  tableLocation = "/delta/" + database + '/' + schema + '/' + table
  checkpointLocation = "/delta/" + database + '/' + schema + "/_checkpoints/" + table

  spark = SparkSession.builder.appName("CSV").getOrCreate()    
  csvData = spark.sparkContext.parallelize(body.split('\n'))

  df = spark.read \
  .option("header", True) \
  .option("delimiter","|") \
  .option("quote", "\"") \
  .option("nullValue", "\\N") \
  .option("inferShema", "true") \
  .option("mergeSchema", "true") \
  .csv(csvData)

  df.write.format("delta").mode("append").save(tableLocation)

dfEventHubCSV.rdd.foreach(SaveData)

Hi Flavio, it would be very helpful if you can post the exact implementation — abiratsis
@AlexandrosBiratsis the full code is too large to post here, but I am gonna edit the post and give you all the code of this function — Flavio Pegas

Flavio Pegas Flavio Pegas · Accepted Answer · 2019-07-02T14:17:02

Well, at the end of all, as always it is something very simple, but I dind't see this anywere.

Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker and you will never find the saved data.

That is why my loops weren't working.

Pyspark saving is not working when called from inside a foreach

1 Answers