0
votes

Trying to load avro from hdfs. I have around 1000 part avro files in a dir. I am using this to read them -

val df = sqlContext.read.format("com.databricks.spark.avro").load("path/to/avro/dir") df.select("QUERY").take(50).foreach(println)

It works if I have pass only 1 or 2 avro files in the path. But if I pass a dir with 400+ files I get this error. Each avro is around 300mb.

org.apache.avro.AvroRuntimeException: java.io.IOException: Filesystem closed
    at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
    at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
    at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.advanceNextRecord(AvroRelation.scala:157)
    at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.hasNext(AvroRelation.scala:166)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:413)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at scala.collection.Iterator$class.foreach(Iterator.scala:742)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
    at scala.collection.AbstractIterator.to(Iterator.scala:1194)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.avro.mapred.FsInput.read(FsInput.java:46)
    at org.apache.avro.file.DataFileReader$SeekableInputStream.read(DataFileReader.java:210)
    at org.apache.avro.io.BinaryDecoder$InputStreamByteSource.tryReadRaw(BinaryDecoder.java:839)
    at org.apache.avro.io.BinaryDecoder.isEnd(BinaryDecoder.java:444)
    at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:261)
    ... 36 more
1

1 Answers

1
votes

It is probably because one of the avro files in the directory was not written correctly. May be the process that wrote the files stopped abruptly. Deleting the files that are not written correctly should solve the problem.