1
votes

I'm trying to create an external table in Hive and another in BigQuery using the same data stored in Google Storage in Avro format wrote with Spark.

I'm using a Dataproc cluster with Spark 2.2.0, Spark-avro 4.0.0 and Hive 2.1.1

There are same differences between Avro versions/packages but If I create the table using Hive and then I write the files using Spark, I'm able to see them in Hive.

But for BigQuery is different, it is able to read Hive Avro files but NOT Spark Avro files.

Error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .someField

Searching a little about the error, the problem is that Spark Avro files are different from Hive/BigQuery Avro files.

I don't know exactly how to fix this, maybe using different Avro package in Spark, but I haven't found which one is compatible with all the systems.

Also I would like to avoid tricky solutions like create a temporary table in Hive and create another using insert into ... select * from ... I'll write a lot of data and I would like to avoid this kind of solutions

Any help would be appreciated. Thanks

3
The error is "Invalid namespace: .someField ". Is ".someField " the correct fullnames? avro.apache.org/docs/current/spec.html#namesXiaoxia Lin
It's another name but it's exactly the name of one of the fields. In fact, is the name of an Array of Struct fields. Seems to be some differences in the schema definition between Avro versions.Javier Montón

3 Answers

1
votes

The error message is thrown by the C++ Avro library, which BigQuery uses. Hive probably uses the Java Avro library. The C++ library doesn't like namespace to start with ".".

This is the code from the library:

if (! ns_.empty() && (ns_[0] == '.' || ns_[ns_.size() - 1] == '.' || std::find_if(ns_.begin(), ns_.end(), invalidChar1) != ns_.end())) {
  throw Exception("Invalid namespace: " + ns_);
}
1
votes

Spark-avro has additional option recordNamespace to set root namespace, so it will not start from ..

https://github.com/databricks/spark-avro/blob/branch-4.0/README-for-old-spark-versions.md

0
votes

Wondering if you ever found an answer to this.

I am seeing the same thing, where I am trying to load data into a bigquery table. The library first loads the data into GCS in avro format. The schema has an array of struct as well, and the namespace beings with a .