Hopefully someone with more experience will have a more polished, battle-tested and ready answer, but here's my shot at it.
You can have your pie and eat part of it too at relatively little cost with GADTs:
{-# LANGUAGE GADTs #-}
data P0 -- phase zero
data P1 -- phase one
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p -- target type
TypeArray :: Id -> Type p -> Expr p -> Type p -- elt type, elt count
TypeStruct :: Id -> [(String, Type p)] -> Type p -- [(field name, field type)]
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
data Expr p where
ExprInt :: Int -> Expr p -- literal int
ExprVar :: Var -> Expr p -- variable
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: Unop -> Expr p -> Expr p
ExprBinop :: Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0 -- Bool gives true=>s.field, false=>p->field
Here the things we've changed are:
The datatypes now use GADT syntax. This means that constructors are declared using their type signatures. data Foo = Bar Int Char
becomes data Foo where Bar :: Int -> Char -> Foo
(aside from syntax, the two are completely equivalent).
We have added a type variable to both Type
and Expr
. This is a so-called phantom type variable: there is no actual data stored that is of type p
, it is used only to enforce invariants in the type system.
We've declared dummy types to represent the phases before and after the transformation: phase zero and phase one. (In a more elaborate system with multiple phases we could potentially use type-level numbers to denote them.)
GADTs let us store type-level invariants in the data structure. Here we have two of them. The first is that recursive positions must be of the same phase as the structure containing them. For example, looking at TypePointer :: Id -> Type p -> Type p
, you pass a Type p
to the TypePointer
constructor and get a Type p
as result, and those p
s must be the same type. (If we wanted to allow different types, we could use p
and q
.)
The second is that we enforce the fact that some constructors can only be used in the first phase. Most of the constructors are polymorphic in the phantom type variable p
, but some of them require that it be P0
. This means that those constructors can only be used to construct values of type Type P0
or Expr P0
, not any other phase.
GADTs work in two directions. The first is that if you have a function which returns a Type P1
, and try to use one of the constructors which returns a Type P0
to construct it, you will get a type error. This is what's called "correct by construction": it's statically impossible to construct an invalid structure (provided you can encode all of the relevant invariants in the type system). The flip side of it is that if you have a value of Type P1
, you can be sure that it was constructed correctly: the TypeOf
and TypeDef
constructors can't have been used (in fact, the compiler will complain if you try to pattern match on them), and any recursive positions must also be of phase P1
. Essentially, when you construct a GADT you store evidence that the type constraints are satisfied, and when you pattern match on it you retrieve that evidence and can take advantage of it.
That was the easy part. Unfortunately, we have some differences between the two types beyond just which constructors are allowed: some of the constructor arguments are different between the phases, and some are only present in the post-transformation phase. We can again use GADTs to encode this, but it's not as low-cost and elegant. One solution would be to duplicate all of the constructors which are different, and have one each for P0
and P1
. But duplication isn't nice. We can try to do it more fine-grained:
-- a couple of helper types
-- here I take advantage of the fact that of the things only present in one phase,
-- they're always present in P1 and not P0, and not vice versa
data MaybeP p a where
NothingP :: MaybeP P0 a
JustP :: a -> MaybeP P1 a
data EitherP p a b where
LeftP :: a -> EitherP P0 a b
RightP :: b -> EitherP P1 a b
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p
TypeArray :: Id -> Type p -> EitherP p (Expr p) Int -> Type p
TypeStruct :: Id -> [(String, MaybeP p Int, Type p)] -> Type p
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
-- for brevity
type MaybeType p = MaybeP p (Type p)
data Expr p where
ExprInt :: MaybeType p -> Int -> Expr p
ExprVar :: MaybeType p -> Var -> Expr p
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: MaybeType p -> Unop -> Expr p -> Expr p
ExprBinop :: MaybeType p -> Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0
Here with some helper types we've enforced the fact that some constructor arguments can only be present in phase one (MaybeP
) and some are different between the two phases (EitherP
). While this makes us completely type-safe, it feels a bit ad-hoc and we still have to wrap things in and out of the MaybeP
s and EitherP
s all the time. I don't know if there's a better solution in that respect. Complete type safety is something, though: we could write fromJustP :: MaybeP P1 a -> a
and be sure that it is completely and utterly safe.
Update: An alternative is to use TypeFamilies
:
data Proxy a = Proxy
class Phase p where
type MaybeP p a
type EitherP p a b
maybeP :: Proxy p -> MaybeP p a -> Maybe a
eitherP :: Proxy p -> EitherP p a b -> Either a b
phase :: Proxy p
phase = Proxy
instance Phase P0 where
type MaybeP P0 a = ()
type EitherP P0 a b = a
maybeP _ _ = Nothing
eitherP _ a = Left a
instance Phase P1 where
type MaybeP P1 a = a
type EitherP P1 a b = b
maybeP _ a = Just a
eitherP _ a = Right a
The only change to Expr
and Type
relative the previous version is that the constructors need to have an added Phase p
constraint, e.g. ExprInt :: Phase p => MaybeType p -> Int -> Expr p
.
Here if the type of p
in a Type
or Expr
is known, you can statically know whether the MaybeP
s will be ()
or the given type and which of the types the EitherP
s are, and can use them directly as that type without explicit unwrapping. When p
is unknown you can use maybeP
and eitherP
from the Phase
class to find out what they are. (The Proxy
arguments are necessary, because otherwise the compiler wouldn't have any way to tell which phase you meant.) This is analogous to the GADT version where, if p
is known, you can be sure of what MaybeP
and EitherP
contains, while otherwise you have to pattern match both possibilities. This solution isn't perfect either in the respect that the 'missing' arguments become ()
rather than disappearing entirely.
Constructing Expr
s and Type
s also seems to be broadly similar between the two versions: if the value you're constructing has anything phase-specific about it then it must specify that phase in its type. The trouble seems to come when you want to write a function polymorphic in p
but still handling phase-specific parts. With GADTs this is straightforward:
asdf :: MaybeP p a -> MaybeP p a
asdf NothingP = NothingP
asdf (JustP a) = JustP a
Note that if I had merely written asdf _ = NothingP
the compiler would have complained, because the type of the output wouldn't be guaranteed to be the same as the input. By pattern matching we can figure out what type the input was and return a result of the same type.
With the TypeFamilies
version, though, this is a lot harder. Just using maybeP
and the resulting Maybe
you can't prove anything to the compiler about types. You can get part of the way there by, instead of having maybeP
and eitherP
return Maybe
and Either
, making them deconstructor functions like maybe
and either
which also make a type equality available:
maybeP :: Proxy p -> (p ~ P0 => r) -> (p ~ P1 => a -> r) -> MaybeP p a -> r
eitherP :: Proxy p -> (p ~ P0 => a -> r) -> (p ~ P1 => b -> r) -> EitherP p a b -> r
(Note that we need Rank2Types
for this, and note also that these are essentially the CPS-transformed versions of the GADT versions of MaybeP
and EitherP
.)
Then we can write:
asdf :: Phase p => MaybeP p a -> MaybeP p a
asdf a = maybeP phase () id a
But that's still not enough, because GHC says:
data.hs:116:29:
Could not deduce (MaybeP p a ~ MaybeP p0 a0)
from the context (Phase p)
bound by the type signature for
asdf :: Phase p => MaybeP p a -> MaybeP p a
at data.hs:116:1-29
NB: `MaybeP' is a type function, and may not be injective
In the fourth argument of `maybeP', namely `a'
In the expression: maybeP phase () id a
In an equation for `asdf': asdf a = maybeP phase () id a
Maybe you could solve this with a type signature somewhere, but at that point it seems like more bother than it's worth. So pending further information from someone else I'm going to recommend using the GADT version, being the simpler and more robust one, at the cost of a little syntactic noise.
Update again: The problem here was that because MaybeP p a
is a type function and there is no other information to go by, GHC has no way to know what p
and a
should be. If I pass in a Proxy p
and use that instead of phase
that solves p
, but a
is still unknown.