Clarifying Data Constructor in Haskell

Question

In the following:

data DataType a = Data a | Datum

I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce. Please let me know if i am getting it well:

1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?

2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?

3 - c) Can I formally Say:

Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.
Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.

Not sure if I understand you correctly, but I think you're to much on the imperative side. You're not creating instances. Everything in Haskell is constant. You're not "creating" things; you're defining them. If you type "Datum", that's not returning "Datum"; that's the wrong way to think about it. It's just a value, similar to e.g. the number "3" in C. Similarly, you're not "constructing" a Data object if you're giving it an (potentially complex) argument; you're defining the Data object together with its argument (which might be a complex definition, but nevertheless a definition). — Nearoo
Data objects without an argument could be considered to be not fully elaborated Data definitions. — Nearoo
Or rather, Data applied to a value of type Foo is a value of type DataType Foo. It doesn't return it so much as it is it. — chepner
Data 'c', for example, is just a value of type DataType Char. It is entirely distinct from 'c' :: Char, and Data is as much a part of the value as is c. In that sense, Data is much more than "just" a function. — chepner

Jon Purdy Jon Purdy · Accepted Answer · 2021-01-06T21:36:18

Another way of writing this:

data DataType a = Data a | Datum

Is with generalised algebraic data type (GADT) syntax, using the GADTSyntax extension, which lets us specify the types of the constructors explicitly:

{-# LANGUAGE GADTSyntax #-}

data DataType a where
  Data  :: a -> DataType a
  Datum ::      DataType a

(The GADTs extension would work too; it would also allow us to specify constructors with different type arguments in the result, like DataType Int vs. DataType Bool, but that’s a more advanced topic, and we don’t need that functionality here.)

These are exactly the types you would see in GHCi if you asked for the types of the constructor functions with :type / :t:

> :{
| data DataType a where
|   Data  :: a -> DataType a
|   Datum ::      DataType a
| :}

> :type Data
Data :: a -> DataType a

> :t Datum
Datum :: DataType a

With ExplicitForAll we can also specify the scope of the type variables explicitly, and make it clearer that the a in the data definition is a separate variable from the a in the constructor definitions by also giving them different names:

data DataType a where
  Data  :: forall b. b -> DataType b
  Datum :: forall c.      DataType c

Some more examples of this notation with standard prelude types:

data Either a b where
  Left  :: forall a b. a -> Either a b
  Right :: forall a b. b -> Either a b

data Maybe a where
  Nothing :: Maybe a
  Just    :: a -> Maybe a

data Bool where
  False :: Bool
  True  :: Bool

data Ordering where
  LT, EQ, GT :: Ordering  -- Shorthand for repeated ‘:: Ordering’

I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce.

Datum and Data are both “constructors” of DataType a values; neither Datum nor Data is a type! These are just “tags” that select between the possible varieties of a DataType a value.

What is produced is always a value of type DataType a for a given a; the constructor selects which “shape” it takes.

A rough analogue of this is a union in languages like C or C++, plus an enumeration for the “tag”. In pseudocode:

enum Tag {
  DataTag,
  DatumTag,
}

// A single anonymous field.
struct DataFields<A> {
  A field1;  
}

// No fields.
struct DatumFields<A> {};

// A union of the possible field types.
union Fields<A> {
  DataFields<A>  data;
  DatumFields<A> datum;
}

// A pair of a tag with the fields for that tag.
struct DataType<A> {
  Tag       tag;
  Fields<A> fields;
}

The constructors are then just functions returning a value with the appropriate tag and fields. Pseudocode:

<A> DataType<A> newData(A x) {
  DataType<A> result;
  result.tag = DataTag;
  result.fields.data.field1 = x;
  return result;
}

<A> DataType<A> newDatum() {
  DataType<A> result;
  result.tag = DatumTag;
  // No fields.
  return result;
}

Unions are unsafe, since the tag and fields can get out of sync, but sum types are safe because they couple these together.

A pattern-match like this in Haskell:

case someDT of
  Datum  -> f
  Data x -> g x

Is a combination of testing the tag and extracting the fields. Again, in pseudocode:

if (someDT.tag == DatumTag) {
  f();
} else if (someDT.tag == DataTag) {
  var x = someDT.fields.data.field1;
  g(x);
}

Again this is coupled in Haskell to ensure that you can only ever access the fields if you have checked the tag by pattern-matching.

So, in answer to your questions:

1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?

Data a in your original code is not defining a data structure, in that Data is not a separate type from DataType a, it’s just one of the possible tags that a DataType a value may have. Internally, a value of type DataType Int is one of the following:

The tag for Data (in GHC, a pointer to an “info table” for the constructor), and a reference to a value of type Int.

x = Data (1 :: Int) :: DataType Int

       +----------+----------------+     +---------+----------------+
x ---->| Data tag | pointer to Int |---->| Int tag | unboxed Int# 1 |
       +----------+----------------+     +---------+----------------+

The tag for Datum, and no other fields.

y = Datum :: DataType Int

        +-----------+
y ----> | Datum tag |
        +-----------+

In a language with unions, the size of a union is the maximum of all its alternatives, since the type must support representing any of the alternatives with mutation. In Haskell, since values are immutable, they don’t require any extra “padding” since they can’t be changed.

It’s a similar situation for standard data types, e.g., a product or sum type:

(x :: X, y :: Y) :: (X, Y)
  +---------+--------------+--------------+
  | (,) tag | pointer to X | pointer to Y |
  +---------+--------------+--------------+

Left (m :: M) :: Either M N
  +-----------+--------------+
  | Left tag  | pointer to M |
  +-----------+--------------+

Right (n :: N) :: Either M N
  +-----------+--------------+
  | Right tag | pointer to N |
  +-----------+--------------+

2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?

This is sort of correct, but again, the constructors Data and Datum aren’t “data structures” by themselves. They’re just the names used to introduce (construct) and eliminate (match) values of type DataType a, for some type a that is chosen by the caller of the constructors to fill in the forall

data DataType a = Data a | Datum says:

If some term e has type T, then the term Data e has type DataType T
Inversely, if some value of type DataType T matches the pattern Data x, then x has type T in the scope of the match (case branch or function equation)
The term Datum has type DataType T for any type T

3 - c) Can I formally Say:

Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.

Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.

Not exactly. A type constructor like DataType :: Type -> Type, Maybe :: Type -> Type, or Either :: Type -> Type -> Type, or [] :: Type -> Type (list), or a polymorphic data type, represents an “infinite” family of concrete types (Maybe Int, Maybe Char, Maybe (String -> String), …) but only in the same way that id :: forall a. a -> a represents an “infinite” family of functions (id :: Int -> Int, id :: Char -> Char, id :: String -> String, …).

That is, the type a here is a parameter filled in with an argument value given by the caller. Usually this is implicit, through type inference, but you can specify it explicitly with the TypeApplications extension:

-- Akin to: \ (a :: Type) -> \ (x :: a) -> x
id        :: forall a. a   -> a
id x = x

id @Int   ::           Int -> Int
id @Int 1 ::                  Int

Data           :: forall a. a    -> DataType a
Data @Char     ::           Char -> DataType Char
Data @Char 'x' ::                   DataType Char

The data constructors of each instantiation don’t really have anything to do with each other. There’s nothing in common between the instantiations Data :: Int -> DataType Int and Data :: Char -> DataType Char, apart from the fact that they share the same tag name.

Another way of thinking about this in Java terms is with the visitor pattern. DataType would be represented as a function that accepts a “DataType visitor”, and then the constructors don’t correspond to separate data types, they’re just the methods of the visitor which accept the fields and return some result. Writing the equivalent code in Java is a worthwhile exercise, but here it is in Haskell:

{-# LANGUAGE RankNTypes #-}
-- (Allows passing polymorphic functions as arguments.)

type DataType a
  = forall r.    -- A visitor with a generic result type
  r              -- With one “method” for the ‘Datum’ case (no fields)
  -> (a -> r)    -- And one for the ‘Data’ case (one field)
  -> r           -- Returning the result

newData :: a -> DataType a
newData field = \ _visitDatum visitData -> visitData field

newDatum :: DataType a
newDatum = \ visitDatum _visitData -> visitDatum

Pattern-matching is simply running the visitor:

matchDT :: DataType a -> b -> (a -> b) -> b
matchDT dt visitDatum visitData = dt visitDatum visitData
-- Or: matchDT dt = dt
-- Or: matchDT = id

-- case someDT of { Datum -> f; Data x -> g x }
-- f :: r
-- g :: a -> r
-- someDT :: DataType a
--        :: forall r. r -> (a -> r) -> r

someDT f (\ x -> g x)

Similarly, in Haskell, data constructors are just the ways of introducing and eliminating values of a user-defined type.

Clarifying Data Constructor in Haskell

2 Answers