What's the difference between ResourceSchema & Schema in pig? There is already Schema class provided, why does pig bother to add another Schema-akin class called ResourceSchema(it is almost like Schema API , it needs to set its ResourceFieldSchema's name and type , it also can has child ResourceSchema) for storage functions?
1 Answers
The API Docs backup @zsxwing's comment:
Schema- The Schema class encapsulates the notion of a schema for a relational operator. A schema is a list of columns that describe the output of a relational operator.Each column in the relation is represented as a FieldSchema, a static class inside the Schema. A column by definition has an alias, a type and a possible schema (if the column is a bag or a tuple).
In addition, each column in the schema has a unique auto generated name used for tracking the lineage of the column in a sequence of statements. The lineage of the column is tracked using a map of the predecessors' columns to the operators that generate the predecessor columns.
The predecessor columns are the columns required in order to generate the column under consideration. Similarly, a reverse lookup of operators that generate the predecessor column to the predecessor column is maintained.
ResourceSchema- A represenation of a schema used to communicate with load and store functions. This is separate from Schema, which is an internal Pig representation of a schema.
So one of the main differences i can see from the API docs is that a Schema is able to track the input columns required to build it, where as ResourceSchema is just the schema definition of the field name, type (and optional sub-schema)