2
votes

I am using Gobblin to periodically extract relational data from Oracle, convert it to avro and publish it to HDFS

My dfs directory structure looks like this

-tables
  |
  -t1
   |
   -2016080712345
    |
    -f1.avro
   |
   -2016070714345
    |
    -f2.avro

I am trying to read from it like so:

val sq = sqlContext.read.format("com.databricks.spark.avro")
  .load("/user/username/gobblin/job-output/tables/t1/")

When I run printSchema I can see that the schema is interpreted correctly.

However, when I run count or show, the DataFrames are empty. I have verified that the .avro files are not empty by converting it to JSON

java -jar avro-tools-1.7.7.jar  tojson --pretty t1/20160706230001_append/part.f1.avro > t1.json

I suspect that it may have something to do with the directory structure. Perhaps the Spark avro libraries only look one level down from the root for .avro files. The logs seem to indicate that only the directories under t1 were listed on the driver:

16/07/07 10:47:09 INFO avro.AvroRelation: Listing hdfs://myhost.mydomain.com:8020/user/username/gobblin/job-output/tables/t1 on driver

16/07/07 10:47:09 INFO avro.AvroRelation: Listing hdfs://myhost.mydomain.com:8020/user/username/gobblin/job-output/tables/t1/20160706230001_append on driver

Has anyone experienced something similar, or know how to get around this? I'd hat to have to point lower than the t1 directory because the names are generated by a timestamp.

1

1 Answers

0
votes

I'm experiencing the same problem. While I don't know the exact reason for the problem, there is a way to go around this:

Instead of pointing to the parent directory, use wildcard and point to avro file level.

sqlContext.read.format("com.databricks.spark.avro")\
    .load("/path/to/tables/t1/*/*.avro")