Loading a CSV file from Blob Storage Container using PySpark

Question

I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook.

I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards.

My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV.

#Load Data source = <Blob SAS URL> elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()

I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark but I am having trouble defining the correct path.

Thank you very much for your help!

Not yet, I was hoping for a more flexible solution. But if that is the only way I can try it. — Felix Schildorfer

Peter Pan Peter Pan · Accepted Answer · 2019-05-07T07:49:02

Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one.

First, to get a Pandas dataframe object via read a blob url.

import pandas as pd

source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)

Then, you can convert it to a PySpark one.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()

Or, the same result with the code below.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()

Hope it helps.

Loading a CSV file from Blob Storage Container using PySpark

1 Answers