0
votes

I am trying to read a CSV file stored in Azure Storage Account. For that, I have installed a spark on my Virtual Machine and trying to read a CSV file in a dataframe from pyspark.

I read somewhere how to do that and I followed the steps and copied the latest hadoop-azure & azure-storage JAR files on my /jar directories. Then, I came up with this error:-

NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities

I searched for this error and found that I need to refer hadoop-azure-2.8.5.jar instead of latest hadoop-azure JAR. So, I replaced this JAR with the latest hadoop-azure jar and again executed my pyspark code.

After executing my code, I encountered with another error: -

: java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;

Also, below is my pyspark code: -

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()
storage_account_name = "<storage_account_name>"
storage_account_access_key = "<storage_account_access_key>"
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net",storage_account_access_key)

spark._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.my_account.blob.core.windows.net", "storage_account_access_key")


df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<path_to_csv>/sample_file.csv")
df.show()
1
Could you please check if hadoop-azure version matches your spark hadoop version?Jim Xu

1 Answers

1
votes

I searched for this and tried various hadoop-azure JAR versions. The one which worked for me was hadoop-azure-2.7.0.jar.

With this JAR version, I was able to read the CSV file from Blob storage.