0
votes

I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook.

I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards.

My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV.

#Load Data source = <Blob SAS URL> elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()

I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark but I am having trouble defining the correct path.

Thank you very much for your help!

1
Well, what do you get by removing the inferSchema option? - OneCricketeer
It says that it can't infer the Schema either way. - Felix Schildorfer
Have you tried manually defining one? - OneCricketeer
Not yet, I was hoping for a more flexible solution. But if that is the only way I can try it. - Felix Schildorfer

1 Answers

0
votes

Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one.

First, to get a Pandas dataframe object via read a blob url.

import pandas as pd

source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)

Then, you can convert it to a PySpark one.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()

Or, the same result with the code below.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()

Hope it helps.