4
votes

There is a tool called Avro-Tools which ships with Avro and can be used to convert between JSON, Avro-Schema (.avsc) and binary formats. But it does not work with circular references.

We have two files:

  1. circular.avsc (generated by Avro)

  2. circular.json (generated by Jackson because it has circular reference and Avro doesn't like the same).

circular.avsc

{
   "type":"record",
   "name":"Parent",
   "namespace":"bigdata.example.avro",
   "fields":[
      {
         "name":"name",
         "type":[
            "null",
            "string"
         ],
         "default":null
      },
      {
         "name":"child",
         "type":[
            "null",
            {
               "type":"record",
               "name":"Child",
               "fields":[
                  {
                     "name":"name",
                     "type":[
                        "null",
                        "string"
                     ],
                     "default":null
                  },
                  {
                     "name":"parent",
                     "type":[
                        "null",
                        "Parent"
                     ],
                     "default":null
                  }
               ]
            }
         ],
         "default":null
      }
   ]
}

circular.json

{
   "@class":"bigdata.example.avro.Parent",
   "@circle_ref_id":1,
   "name":"parent",
   "child":{
      "@class":"bigdata.example.avro.DerivedChild",
      "@circle_ref_id":2,
      "name":"hello",
      "parent":1
   }
}

Command to run avro-tools on the above

java -jar avro-tools-1.7.6.jar fromjson --schema-file circular.avsc circular.json

Output

2014-06-09 14:29:17.759 java[55860:1607] Unable to load realm mapping info from SCDynamicStore Objavro.codenullavro.schema? {"type":"record","name":"Parent","namespace":"bigdata.example.avro","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"child","type":["null",{"type":"record","name":"Child","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"parent","type":["null","Parent"],"default":null}]}],"default":null}]}?'???K?jH!??Ė?Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_STRING at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)

at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)

at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)

Some other JSON values tried with the same schema but that did not work

JSON 1

{
   "name":"parent",
   "child":{
      "name":"hello",
      "parent":null
   }
}

JSON 2

{
   "name":"parent",
   "child":{
      "name":"hello",
   }
}

JSON 3

 {
   "@class":"bigdata.example.avro.Parent",
   "@circle_ref_id":1,
   "name":"parent",
   "child":{
      "@class":"bigdata.example.avro.DerivedChild",
      "@circle_ref_id":2,
      "name":"hello",
      "parent":null
   }
}

Removing some of the "optional" elements:

circular.avsc

{
   "type":"record",
   "name":"Parent",
   "namespace":"bigdata.example.avro",
   "fields":[
      {
         "name":"name",
         "type":
            "string",
         "default":null
      },
      {
         "name":"child",
         "type":
            {
               "type":"record",
               "name":"Child",
               "fields":[
                  {
                 "name":"name",
                 "type":
                    "string",
                 "default":null
                  },
                  {
                     "name":"parent",
                     "type":
                        "Parent",
                     "default":null
                  }
               ]
            },
         "default":null
      }
   ]
}

circular.json

 {
   "@class":"bigdata.example.avro.Parent",
   "@circle_ref_id":1,
   "name":"parent",
   "child":{
      "@class":"bigdata.example.avro.DerivedChild",
      "@circle_ref_id":2,
      "name":"hello",
      "parent":1
   }
}

output

2014-06-09 15:30:53.716 java[56261:1607] Unable to load realm mapping info from SCDynamicStore Objavro.codenullavro.schema?{"type":"record","name":"Parent","namespace":"bigdata.example.avro","fields":[{"name":"name","type":"string","default":null},{"name":"child","type":{"type":"record","name":"Child","fields":[{"name":"name","type":"string","default":null},{"name":"parent","type":"Parent","default":null}]},"default":null}]}?x?N??O"?M?`AbException in thread "main" java.lang.StackOverflowError

at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:212)

at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)

at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)

at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)

at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:216)

at org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:323)

Does anyone know how I can make circular reference work with Avro?

1

1 Answers

1
votes

I met this same problem recently and resolved in a work-around way, hopefully it could help.

Based on the Avro specification:

JSON Encoding Except for unions, the JSON encoding is the same as is used to encode field default values.

The value of a union is encoded in JSON as follows:

  • if its type is null, then it is encoded as a JSON null;
  • otherwise it is encoded as a JSON object with one name/value pair whose name is the type's name and whose value is the recursively encoded value. For Avro's named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.

For example, the union schema ["null","string","Foo"], where Foo is a record name, would encode:

  • null as null;
  • the string "a" as {"string": "a"};
  • and a Foo instance as {"Foo": {...}}, where {...} indicates the JSON encoding of a Foo instance.

If the source file could not be changed to follow the requirement, maybe we have to change the code. So I customized the original org.apache.avro.io.JsonDecoder class from avro-1.7.7 package and created my own class MyJsonDecoder.

Here is the key placed I changed besides create new constructors and class name:

    @Override
public int readIndex() throws IOException {
    advance(Symbol.UNION);
    Symbol.Alternative a = (Symbol.Alternative) parser.popSymbol();

    String label;
    if (in.getCurrentToken() == JsonToken.VALUE_NULL) {
        label = "null";
//***********************************************
// Original code: according to Avor document "JSON Encoding":
// it is encoded as a Json object with one name/value pair whose name is
//   the type's name and whose value is the recursively encoded value.
// Can't change source data, so remove this rule.
//        } else if (in.getCurrentToken() == JsonToken.START_OBJECT &&
//                in.nextToken() == JsonToken.FIELD_NAME) {
//            label = in.getText();
//            in.nextToken();
//            parser.pushSymbol(Symbol.UNION_END);
//***********************************************
        // Customized code:
        // Add to check if type is in the union then parse it.
        // Check if type match types in union or not.
    } else {
        label = findTypeInUnion(in.getCurrentToken(), a);

        // Field missing but not allow to be null
        //   or field type is not in union.
        if (label == null) {
            throw error("start-union, type may not be in UNION,");
        }
    }
//***********************************************
// Original code: directly error out if union
//        } else {
//                throw error("start-union");
//        }
//***********************************************
    int n = a.findLabel(label);
    if (n < 0)
        throw new AvroTypeException("Unknown union branch " + label);
    parser.pushSymbol(a.getSymbol(n));
    return n;
}

/**
 * Method to check if current JSON token type is declared in union.
 * Do NOT support "record", "enum", "fix":
 * Because there types require user defined name in Avro schema,
 * if user defined names could not be found in Json file, can't decode.
 *
 * @param jsonToken         JsonToken
 * @param symbolAlternative Symbol.Alternative
 * @return String Parsing label, decode in which way.
 */
private String findTypeInUnion(final JsonToken jsonToken,
                               final Symbol.Alternative symbolAlternative) {
    // Create a map for looking up: JsonToken and Avro type
    final HashMap<JsonToken, String> json2Avro = new HashMap<>();

    for (int i = 0; i < symbolAlternative.size(); i++) {
        // Get the type declared in union: symbolAlternative.getLabel(i).
        // Map the JsonToken with Avro type.
        switch (symbolAlternative.getLabel(i)) {
            case "null":
                json2Avro.put(JsonToken.VALUE_NULL, "null");
                break;
            case "boolean":
                json2Avro.put(JsonToken.VALUE_TRUE, "boolean");
                json2Avro.put(JsonToken.VALUE_FALSE, "boolean");
                break;
            case "int":
                json2Avro.put(JsonToken.VALUE_NUMBER_INT, "int");
                break;
            case "long":
                json2Avro.put(JsonToken.VALUE_NUMBER_INT, "long");
                break;
            case "float":
                json2Avro.put(JsonToken.VALUE_NUMBER_FLOAT, "float");
                break;
            case "double":
                json2Avro.put(JsonToken.VALUE_NUMBER_FLOAT, "double");
                break;
            case "bytes":
                json2Avro.put(JsonToken.VALUE_STRING, "bytes");
                break;
            case "string":
                json2Avro.put(JsonToken.VALUE_STRING, "string");
                break;
            case "array":
                json2Avro.put(JsonToken.START_ARRAY, "array");
                break;
            case "map":
                json2Avro.put(JsonToken.START_OBJECT, "map");
                break;
            default: break;
        }
    }

    // Looking up the map to find out related Avro type to JsonToken
    return json2Avro.get(jsonToken);
}

The generate idea is to check the type from source file could be found in union or not.

Here still has some issues:

  1. This solution doesn't support "record", "enum", or "fixed" Avro type because these types require user defined name. E.g. if you want union "type": ["null", {"name": "abc", "type": "record", "fields" : ...}], this code will not work. For Primitive type, this should work. But please test it before your use it for your project.

  2. Personally I think records should not be null because I consider records are what I need to make sure exists, if something is missing, that means I have bigger problem. If it could be omit, I prefer to use "map" as type instead of using "record" when you define the schema.

Hopefully this could help.