0
votes

What's the difference between ResourceSchema & Schema in pig? There is already Schema class provided, why does pig bother to add another Schema-akin class called ResourceSchema(it is almost like Schema API , it needs to set its ResourceFieldSchema's name and type , it also can has child ResourceSchema) for storage functions?

1
I'm not sure. In my opinion, ResourceSchema may be used to hide the internal structure Schema. - zsxwing

1 Answers

0
votes

The API Docs backup @zsxwing's comment:

  • Schema - The Schema class encapsulates the notion of a schema for a relational operator. A schema is a list of columns that describe the output of a relational operator.

    Each column in the relation is represented as a FieldSchema, a static class inside the Schema. A column by definition has an alias, a type and a possible schema (if the column is a bag or a tuple).

    In addition, each column in the schema has a unique auto generated name used for tracking the lineage of the column in a sequence of statements. The lineage of the column is tracked using a map of the predecessors' columns to the operators that generate the predecessor columns.

    The predecessor columns are the columns required in order to generate the column under consideration. Similarly, a reverse lookup of operators that generate the predecessor column to the predecessor column is maintained.

  • ResourceSchema - A represenation of a schema used to communicate with load and store functions. This is separate from Schema, which is an internal Pig representation of a schema.

So one of the main differences i can see from the API docs is that a Schema is able to track the input columns required to build it, where as ResourceSchema is just the schema definition of the field name, type (and optional sub-schema)