0
votes

I am trying to read the csv file from pyspark, while reading it is throwing

py4j.protocol.Py4JJavaError: An error occurred while calling o30.csv. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593) ... 25 more

Process finished with exit code 1

Error, So could anyone please suggest me where am i doing wrong in below code.

from pyspark.sql import SparkSession

SECRET_ACCESS_KEY = "XXXXXXXXXXX"
STORAGE_NAME = "azuresvkstorageaccount11123"
CONTAINER = "inputstorage1"
FILE_NAME = "movies.csv"


spark = SparkSession.builder.appName("Azure_PySpark_Connectivity")\
                    .master("local[*]")\
                    .getOrCreate()

fs_acc_key = "fs.azure.account.key." + STORAGE_NAME + ".blob.core.windows.net"
spark.conf.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(fs_acc_key, SECRET_ACCESS_KEY)

file_path = "wasb://[email protected]/movies.csv"
print(file_path)

Df = spark.read.csv(path=file_path,header=True,inferSchema=True) #Error Coming from this line it is unable to read the csv file

#Df.show(20,True)
1

1 Answers

0
votes

I have figured out the problem, it is coming from maven jars, the solution is

  1. Download hadoop-azure and azure-storage jars from the maven portal manually and copy these jars to spark/jars/. folder.
  2. Do the same for the jetty-utils jar, add this jar to spark/jars/. folder
  3. Then refresh and run the script again, it works perfectly.