0
votes

I am getting the following errors while running Spark on Hadoop It is also giving me this error when I use the Scala API. I was convinced that the error is related to Spark paths and CLASSPATH.

The error is:

stage 0.0 failed 4 times; aborting job 17/04/25 13:36:53 WARN TaskSetManager: Lost task 11.2 in stage 0.0 (TID 20, 10.98.92.150, executor 4): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 13.1 in stage 0.0 (TID 21, 10.98.92.150, executor 4): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, 10.98.92.150, executor 5): TaskKilled (killed intentionally) 17/04/25 13:36:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.98.92.150, executor 5): TaskKilled (killed intentionally) Traceback (most recent call last): File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/examples/src/main/python/test.py", line 8, in hive_context.sql("select count(1) from src_tmp").show() File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 318, in show File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/apache/spark/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.98.92.151, executor 2): java.lang.NoClassDefFoundError: Could not initialize class org.apache.parquet.CorruptStatistics at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:346) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:417) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:351) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.parquet.CorruptStatistics at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:346) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:417) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:351) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

1
Please copy-paste only relevant track information. All the trace above will not encourage anyone to help you. - Pushkr
The trace itself is not that bad; if only it was formatted correctly... - Samson Scharfrichter
I'm sorry about the formatting, I am new to the forum. - xyz
Looks like your edit retained lots of useless stuff, but dropped the useful error message (the one that I tried to copy in my answer -- and which appears to have been truncated from the begnning, on second thoughts) - Samson Scharfrichter

1 Answers

4
votes

EDIT The original answer was actually wrong. Sure, the "classpath first" property is useful in some cases, but not in this one. Thanks for the votes but they are not deserved.  :-(

java.lang.NoSuchMethodError: org.apache.parquet.SemanticVersion.(IIILjava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
at org.apache.parquet.CorruptStatistics

You are using Spark 2.1.0, which bundles Parquet V1.8.1, which defines its classes in package org.apache.parquet.


Original answer (which missed the point)

Your Hadoop distro seems to be from Cloudera, and (for example) CDH 5.10 bundles Parquet V1.5.0, which defines its classes in package parquet.

unzip -l $SPARK_HOME/jars/parquet-column-1.8.1.jar | grep CorruptStatistics.class
     3507  07-17-2015 13:56   org/apache/parquet/CorruptStatistics.class

unzip -l /opt/cloudera/parcels/CDH/jars/parquet-common-1.5.0-cdh5.10.0.jar | grep SemanticVersion.class
     5406  01-20-2017 11:57   parquet/SemanticVersion.class

So these versions are clearly incompatible.

EDIT So incompatible that they do not interfere, simply because the package is different.

When you run Spark executors under YARN, by default, the CLASSPATH contains the JARs from both versions of Parquet in random order, with catastrophic results.

Workaround: make sure your Spark JARs have precedence in the CLASSPATH with either

  • a command-line option on each execution
    --conf spark.yarn.user.classpath.first=true
  • or a global entry in spark-defaults.conf
    spark.yarn.user.classpath.first true


Better analysis with no real solution at this point(sorry)

The "NoSuchMethodError" complains that it could not find, at run-time, a method that was present at compile time.
That's for class SemanticVersion and a method with no name -- which is clearly wrong, even a constructor should be be marked with .<init> or sthg similar -- so I assume the error message got truncated, maybe because of < character being swallowed by S.O. message editor when you pasted it.

The method details: arguments (int, int, int, String, String, String) and a return type void. See that post for reference.

OK, let's assume class CorruptStatistics was compiled with a call to new SemanticVersion(1, 2, 3, "a", "b", "c") which was valid at compile time, but for some reason, when SemanticVersion was compiled, that constructor was not present (release mismatch?!)

That's insane, because the "official" source code (cf. apache GIT repo under "parquet-column" and "parquet-common") shows no trace of such a constructor, never, ever. Actually the CorruptStatistics is a bug fix for compatibility with some buggy Parquet formats, and SemanticVersion has just two constructors and no String in these.
Some "non-official" (but easier to read) source code for V1.8.1 can be found here and here.

Bottom line: all that makes no sense, unless

  • Spark 2.1.0 ships with Parquet JARs that are somehow inconsistent (and nobody found that bug yet!?!)
  • or you built a custom JAR, embedding Parquet classes that you have customised -- or a rogue fork of Parquet invoked in a rogue POM (??)
  • or you have deployed an exotic Cloudera parcel that places custom Parquet JARs in the CLASSPATH (but the "user classpath first" trick should have fixed that - unless you have these exotic JARs explicitly in spark.executor.extraClassPath)

To solve that mistery, I strongly suggest that you inspect all JARs that are susceptible to be present at run-time in the YARN CLASSPATH, including your custom JARs + Spark JARs + Cloudera CDH JARs + Cloudera extra parcels JARs, searching for any reference of CorruptStatistics.class -- you have the example unzip -l | grep command for that; wrap it in a loop, and be ready for surprises.