18
votes

I'm trying to write a compiler for a C-like language in Haskell. The compiler progresses by transforming an AST. The first pass parses the input to create the AST, tying a knot with the symbol table to allow symbols to be located before they have been defined without the need for forward references.

The AST contains information about types and expressions, and there can be connections between them; e.g. sizeof(T) is an expression that depends on a type T, and T[e] is an array type that depends on a constant expression e.

Types and expressions are represented by Haskell data types like so:

data Type = TypeInt Id
          | TypePointer Id Type -- target type
          | TypeArray Id Type Expr -- elt type, elt count
          | TypeStruct Id [(String, Type)] -- [(field name, field type)]
          | TypeOf Id Expr
          | TypeDef Id Type

data Expr = ExprInt Int -- literal int
          | ExprVar Var -- variable
          | ExprSizeof Type
          | ExprUnop Unop Expr
          | ExprBinop Binop Expr Expr
          | ExprField Bool Expr String -- Bool gives true=>s.field, false=>p->field

Where Unop includes operators like address-of (&), and dereference (*), and Binop includes operators like plus (+), and times (*), etc...

Note that each type is assigned a unique Id. This is used to construct a type dependency graph in order to detect cycles which lead to infinite types. Once we are sure that there are no cycles in the type graph, it is safe to apply recursive functions over them without getting into an infinite loop.

The next step is to determine the size of each type, assign offsets to struct fields, and replace ExprFields with pointer arithmetic. In doing so, we can determine the type of expressions, and eliminate ExprSizeofs, ExprFields, TypeDefs, and TypeOfs from the type graph, so our types and expressions have evolved, and now look more like this:

data Type' = TypeInt Id
           | TypePointer Id Type'
           | TypeArray Id Type' Int -- constant expression has been eval'd
           | TypeStruct Id [(String, Int, Type')] -- field offset has been determined

data Expr' = ExprInt Type' Int
           | ExprVar Type' Var
           | ExprUnop Type' Unop Expr'
           | ExprBinop Type' Binop Expr' Expr'

Note that we've eliminated some of the data constructors, and slightly changed some of the others. In particular, Type' no longer contains any Expr', and every Expr' has determined its Type'.

So, finally, the question: Is it better to create two almost identical sets of data types, or try to unify them into a single data type?

Keeping two separate data types makes it explicit that certain constructors can no longer appear. However, the function which performs constant folding to evaluate constant expressions will have type:

foldConstants :: Expr -> Either String Expr'

But this means that we cannot perform constant folding later on with Expr's (imagine some pass that manipulates an Expr', and wants to fold any emerged constant expressions). We would need another implementation:

foldConstants' :: Expr' -> Either String Expr'

On the other hand, keeping a single type would solve the constant folding problem, but would prevent the type checker from enforcing static invariants.

Furthermore, what do we put into the unknown fields (like field offsets, array sizes, and expression types) during the first pass? We could plug the holes with undefined, or error "*hole*", but that feels like a disaster waiting to happen (like NULL pointers that you can't even check). We could change the unknown fields into Maybes, and plug the holes with Nothing (like NULL pointers that we can check), but it would be annoying in subsequent passes to have to keep pulling values out of Maybes that are always going to be Justs.

2
I have the exact same problem. It sounds like a great idea to have a unified structure, however, even if you did, could you really reuse the same functions for both phases? It seems to me that the tiny structural changes result in larger leaps of logic.Karolis Juodelė
Yes, although if every function that operates on these types runs in a monad which gives you access to a context, and the ability to generate errors, it should be possible to transform the old constructs into the newer ones on-the-fly.pat

2 Answers

16
votes

Hopefully someone with more experience will have a more polished, battle-tested and ready answer, but here's my shot at it.

You can have your pie and eat part of it too at relatively little cost with GADTs:

{-# LANGUAGE GADTs #-}

data P0 -- phase zero
data P1 -- phase one

data Type p where
     TypeInt     :: Id -> Type p
     TypePointer :: Id -> Type p -> Type p             -- target type
     TypeArray   :: Id -> Type p -> Expr p -> Type p   -- elt type, elt count
     TypeStruct  :: Id -> [(String, Type p)] -> Type p -- [(field name, field type)]
     TypeOf      :: Id -> Expr P0 -> Type P0
     TypeDef     :: Id -> Type P0 -> Type P0

data Expr p where
     ExprInt     :: Int -> Expr p                        -- literal int
     ExprVar     :: Var -> Expr p                        -- variable
     ExprSizeof  :: Type P0 -> Expr P0
     ExprUnop    :: Unop -> Expr p -> Expr p
     ExprBinop   :: Binop -> Expr p -> Expr p -> Expr p
     ExprField   :: Bool -> Expr P0 -> String -> Expr P0 -- Bool gives true=>s.field, false=>p->field

Here the things we've changed are:

  • The datatypes now use GADT syntax. This means that constructors are declared using their type signatures. data Foo = Bar Int Char becomes data Foo where Bar :: Int -> Char -> Foo (aside from syntax, the two are completely equivalent).

  • We have added a type variable to both Type and Expr. This is a so-called phantom type variable: there is no actual data stored that is of type p, it is used only to enforce invariants in the type system.

  • We've declared dummy types to represent the phases before and after the transformation: phase zero and phase one. (In a more elaborate system with multiple phases we could potentially use type-level numbers to denote them.)

  • GADTs let us store type-level invariants in the data structure. Here we have two of them. The first is that recursive positions must be of the same phase as the structure containing them. For example, looking at TypePointer :: Id -> Type p -> Type p, you pass a Type p to the TypePointer constructor and get a Type p as result, and those ps must be the same type. (If we wanted to allow different types, we could use p and q.)

  • The second is that we enforce the fact that some constructors can only be used in the first phase. Most of the constructors are polymorphic in the phantom type variable p, but some of them require that it be P0. This means that those constructors can only be used to construct values of type Type P0 or Expr P0, not any other phase.

GADTs work in two directions. The first is that if you have a function which returns a Type P1, and try to use one of the constructors which returns a Type P0 to construct it, you will get a type error. This is what's called "correct by construction": it's statically impossible to construct an invalid structure (provided you can encode all of the relevant invariants in the type system). The flip side of it is that if you have a value of Type P1, you can be sure that it was constructed correctly: the TypeOf and TypeDef constructors can't have been used (in fact, the compiler will complain if you try to pattern match on them), and any recursive positions must also be of phase P1. Essentially, when you construct a GADT you store evidence that the type constraints are satisfied, and when you pattern match on it you retrieve that evidence and can take advantage of it.

That was the easy part. Unfortunately, we have some differences between the two types beyond just which constructors are allowed: some of the constructor arguments are different between the phases, and some are only present in the post-transformation phase. We can again use GADTs to encode this, but it's not as low-cost and elegant. One solution would be to duplicate all of the constructors which are different, and have one each for P0 and P1. But duplication isn't nice. We can try to do it more fine-grained:

-- a couple of helper types
-- here I take advantage of the fact that of the things only present in one phase,
-- they're always present in P1 and not P0, and not vice versa
data MaybeP p a where
     NothingP :: MaybeP P0 a
     JustP    :: a -> MaybeP P1 a

data EitherP p a b where
     LeftP  :: a -> EitherP P0 a b
     RightP :: b -> EitherP P1 a b

data Type p where
     TypeInt     :: Id -> Type p
     TypePointer :: Id -> Type p -> Type p
     TypeArray   :: Id -> Type p -> EitherP p (Expr p) Int -> Type p
     TypeStruct  :: Id -> [(String, MaybeP p Int, Type p)] -> Type p
     TypeOf      :: Id -> Expr P0 -> Type P0
     TypeDef     :: Id -> Type P0 -> Type P0

-- for brevity
type MaybeType p = MaybeP p (Type p)

data Expr p where
     ExprInt     :: MaybeType p -> Int -> Expr p
     ExprVar     :: MaybeType p -> Var -> Expr p
     ExprSizeof  :: Type P0 -> Expr P0
     ExprUnop    :: MaybeType p -> Unop -> Expr p -> Expr p
     ExprBinop   :: MaybeType p -> Binop -> Expr p -> Expr p -> Expr p
     ExprField   :: Bool -> Expr P0 -> String -> Expr P0

Here with some helper types we've enforced the fact that some constructor arguments can only be present in phase one (MaybeP) and some are different between the two phases (EitherP). While this makes us completely type-safe, it feels a bit ad-hoc and we still have to wrap things in and out of the MaybePs and EitherPs all the time. I don't know if there's a better solution in that respect. Complete type safety is something, though: we could write fromJustP :: MaybeP P1 a -> a and be sure that it is completely and utterly safe.

Update: An alternative is to use TypeFamilies:

data Proxy a = Proxy

class Phase p where
    type MaybeP  p a
    type EitherP p a b
    maybeP  :: Proxy p -> MaybeP p a -> Maybe a
    eitherP :: Proxy p -> EitherP p a b -> Either a b
    phase   :: Proxy p
    phase = Proxy

instance Phase P0 where
    type MaybeP  P0 a   = ()
    type EitherP P0 a b = a
    maybeP  _ _ = Nothing
    eitherP _ a = Left a

instance Phase P1 where
    type MaybeP  P1 a   = a
    type EitherP P1 a b = b
    maybeP  _ a = Just  a
    eitherP _ a = Right a

The only change to Expr and Type relative the previous version is that the constructors need to have an added Phase p constraint, e.g. ExprInt :: Phase p => MaybeType p -> Int -> Expr p.

Here if the type of p in a Type or Expr is known, you can statically know whether the MaybePs will be () or the given type and which of the types the EitherPs are, and can use them directly as that type without explicit unwrapping. When p is unknown you can use maybeP and eitherP from the Phase class to find out what they are. (The Proxy arguments are necessary, because otherwise the compiler wouldn't have any way to tell which phase you meant.) This is analogous to the GADT version where, if p is known, you can be sure of what MaybeP and EitherP contains, while otherwise you have to pattern match both possibilities. This solution isn't perfect either in the respect that the 'missing' arguments become () rather than disappearing entirely.

Constructing Exprs and Types also seems to be broadly similar between the two versions: if the value you're constructing has anything phase-specific about it then it must specify that phase in its type. The trouble seems to come when you want to write a function polymorphic in p but still handling phase-specific parts. With GADTs this is straightforward:

asdf :: MaybeP p a -> MaybeP p a
asdf NothingP  = NothingP
asdf (JustP a) = JustP a

Note that if I had merely written asdf _ = NothingP the compiler would have complained, because the type of the output wouldn't be guaranteed to be the same as the input. By pattern matching we can figure out what type the input was and return a result of the same type.

With the TypeFamilies version, though, this is a lot harder. Just using maybeP and the resulting Maybe you can't prove anything to the compiler about types. You can get part of the way there by, instead of having maybeP and eitherP return Maybe and Either, making them deconstructor functions like maybe and either which also make a type equality available:

maybeP  :: Proxy p -> (p ~ P0 => r) -> (p ~ P1 => a -> r) -> MaybeP p a -> r
eitherP :: Proxy p -> (p ~ P0 => a -> r) -> (p ~ P1 => b -> r) -> EitherP p a b -> r

(Note that we need Rank2Types for this, and note also that these are essentially the CPS-transformed versions of the GADT versions of MaybeP and EitherP.)

Then we can write:

asdf :: Phase p => MaybeP p a -> MaybeP p a
asdf a = maybeP phase () id a

But that's still not enough, because GHC says:

data.hs:116:29:
 Could not deduce (MaybeP p a ~ MaybeP p0 a0)
 from the context (Phase p)
   bound by the type signature for
              asdf :: Phase p => MaybeP p a -> MaybeP p a
   at data.hs:116:1-29
 NB: `MaybeP' is a type function, and may not be injective
 In the fourth argument of `maybeP', namely `a'
 In the expression: maybeP phase () id a
 In an equation for `asdf': asdf a = maybeP phase () id a

Maybe you could solve this with a type signature somewhere, but at that point it seems like more bother than it's worth. So pending further information from someone else I'm going to recommend using the GADT version, being the simpler and more robust one, at the cost of a little syntactic noise.

Update again: The problem here was that because MaybeP p a is a type function and there is no other information to go by, GHC has no way to know what p and a should be. If I pass in a Proxy p and use that instead of phase that solves p, but a is still unknown.

4
votes

There's no ideal solution to this problem, since each one has different pros and cons.

I would personally go with using a single data type "tree" and add separate data constructors for things that need to be differentiated. I.e.:

data Type
  = ...
  | TypeArray Id Type Expr
  | TypeResolvedArray Id Type Int
  | ...

This has the advantage that you can run the same phase multiple times on the same tree, as you say, but the reasoning goes deeper than that: Let's say that you are implementing a syntax element that generates more AST (something like include or a C++ template kind of deal, and it can depend on constant exprs like your TypeArray so you can't evaluate it in the first iteration). With the unified data type approach, you can just insert the new AST in your existing tree, and not only can you run the same phases as before on that tree directly, but you will get caching for free, i.e. if the new AST references an array by using sizeof(typeof(myarr)) or something, you don't have to determine the constant size of myarr again, because its type is already a TypeResolvedArray from your previous resolution phase.

You could use a different representation when you are past all of your compilation phases, and it is time to interpret the code (or something); then you are certain of the fact that no more AST changes will be necessary, and a more streamlined representation might be a good idea.

By the way, you should use Data.Word.Word instead of Data.Int.Int for array sizes. It is such a common error in C to use ints for indexing arrays, while C pointers actually are unsigned. Please don't make this mistake in your language, too, unless you really want to support arrays with negative sizes.