Reading a zst archive in Scala & Spark: native zStandard library not available

Question

I'm trying to read a zst-compressed file using Spark on Scala.

 import org.apache.spark.sql._
 import org.apache.spark.sql.types._
 val schema = new StructType()
      .add("title", StringType, true)
      .add("selftext", StringType, true)
      .add("score", LongType, true)
      .add("created_utc", LongType, true)
      .add("subreddit", StringType, true)
      .add("author", StringType, true)
 val df_with_schema = spark.read.schema(schema).json("/home/user/repos/concepts/abcde/RS_2019-09.zst")

 df_with_schema.take(1)

Unfortunately this produces the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.0.101 executor driver): java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.

My hadoop checknative looks as follows, but I understand from here that Apache Spark has its own ZStandardCodec.

Native library checking:

hadoop: true /opt/hadoop/lib/native/libhadoop.so.1.0.0

zlib: true /lib/x86_64-linux-gnu/libz.so.1

zstd : true /lib/x86_64-linux-gnu/libzstd.so.1

snappy: true /lib/x86_64-linux-gnu/libsnappy.so.1

lz4: true revision:10301

bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1

openssl: false EVP_CIPHER_CTX_cleanup

ISA-L: false libhadoop was built without ISA-L support

PMDK: false The native code was built without PMDK support.

Any ideas are appreciated, thank you!

UPDATE 1: As per this post, I've understood better what the message meant, namely that zstd is not enabled when compiling Hadoop by default, so one of possible solutions would be obviously building it with that flag enabled.

cnstlungu cnstlungu · Accepted Answer · 2021-04-18T21:25:15

Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:

spark.driver.extraLibraryPath=/opt/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/hadoop/lib/native

I can now read the zst archive into a DataFrame with no issues.

Reading a zst archive in Scala & Spark: native zStandard library not available

1 Answers