I am using Gobblin to periodically extract relational data from Oracle
, convert it to avro
and publish it to HDFS
My dfs directory structure looks like this
-tables
|
-t1
|
-2016080712345
|
-f1.avro
|
-2016070714345
|
-f2.avro
I am trying to read from it like so:
val sq = sqlContext.read.format("com.databricks.spark.avro")
.load("/user/username/gobblin/job-output/tables/t1/")
When I run printSchema
I can see that the schema is interpreted correctly.
However, when I run count
or show
, the DataFrames
are empty. I have verified that the .avro
files are not empty by converting it to JSON
java -jar avro-tools-1.7.7.jar tojson --pretty t1/20160706230001_append/part.f1.avro > t1.json
I suspect that it may have something to do with the directory structure. Perhaps the Spark avro libraries only look one level down from the root for .avro
files. The logs seem to indicate that only the directories under t1 were listed on the driver:
16/07/07 10:47:09 INFO avro.AvroRelation: Listing hdfs://myhost.mydomain.com:8020/user/username/gobblin/job-output/tables/t1 on driver
16/07/07 10:47:09 INFO avro.AvroRelation: Listing hdfs://myhost.mydomain.com:8020/user/username/gobblin/job-output/tables/t1/20160706230001_append on driver
Has anyone experienced something similar, or know how to get around this? I'd hat to have to point lower than the t1
directory because the names are generated by a timestamp.