Parsing custom infix operators + implementation with FParsec

Question

I'm a little stuck on the way "real parsers", such as F# or Haskell, do to parse custom operators. For a "normal" language, we would simply define an AST node at which there would be predefined operator possibilities, for example: +, -, *, ==, >=, +=, ... etc.

But I wonder how to do it in a functional language that allow you to create custom operators, let's take OCaml as an example, quite close to F# (language of my implementation), and is quite well known.

Thus, each operator is a function and has a type, as well as a definition, and we can create our own operators:

val (+) : 'a -> 'a -> 'a
let (+) x y = x + y

val (|>) : 'a -> ('a -> 'b) -> 'b
let (|>) x f = f x

So I wonder how it works with parsing to make it work.

1) How does the parser know that we want to use a custom operator? If we use a function that takes another function in the first argument and another element in the second, how does it know that we call a function rather than using an infix operator?

let example x =
    // Do we add up, or do we call the function "takeOpAndOther"?
    takeOpAndOther + x

2) To answer this question, I thought of a way to do this in F#, thanks to FParsec. The first solution that came to mind was to simply use OperatorPrecedenceParser. The concern is that this means only works for predefined operators (or if there is a way to do what I want with it, I don't know how).

Then I thought about creating a simple parser:

open FParsec

type Expression =
    | Number of int
    | InfixF of Expression * string * Expression
    | DataName of string
    | FunctionCall of string * Expression list

let ws = skipMany (pchar ' ' <|> pchar '\t') <?> ""
let ws1 = skipMany1 (pchar ' ' <|> pchar '\t') <?> ""

let identifier = many1Satisfy (fun c -> isLetter c || isDigit c)

let allowedSymbols =
   [ '!'; '@'; '#'; '$'; '%'; '^'; '&';
     '§'; '*'; '°'; '.'; '~'; ':'; '-';
     '+'; '='; '?'; '/'; '>'; '<'; '|'; ]

let customOperatorIdentifier = many1SatisfyL (fun c -> allowedSymbols |> List.contains c) "valid custom operator"

// I call about this parser
let rec infixF () = parse {
        let! lvalue = ws >>? expression
        let! op = ws >>? customOperatorIdentifier
        let! rvalue = ws >>? expression
        return InfixF(lvalue, op, rvalue)
    }

and number = pint32 |>> Number

and dataName = identifier |>> DataName

and functionCall () = parse {
        let! id = ws >>? identifier
        let! parameters = sepEndBy1 (ws >>? expression) ws1
        return FunctionCall(id, parameters)
    }

and expression =
    attempt number <|>
    attempt dataName <|>
    attempt (functionCall ()) <|>
    infixF ()

let test code =
    match run (ws >>? expression .>>? ws .>>? eof) code with
    | Success (result, _, _) -> printfn "%A" result
    | Failure (msg, _, _)    -> printfn "%s" msg

test "87 + 12"

Except, as you might expect, it doesn't work as expected. Indeed, as the code is presented (because when I try infixF alone, and remove it from expression, then it works, but obviously only for one expression: x + y, but not x + y + z), it will cause an overflow error every time. This is the main problem, I think, I encounter in my implementation.

However, the two solutions described do not satisfy one of my questions, which was the sending of function-operator.

In short... I have some questions I would like to receive explanations about, and an implementation problem I would like to fix.

Thank you! :)

Don't have time for a full answer right now, but your concern that OperatorPrecedenceParser only works with predefined operators appears unfounded. There's an AddOperator method that can add new operators dynamically; so when you parse something like F#'s let (>=>) f g = ..., you can dynamically add the >=> operator to your parser. I haven't done it myself yet so I can't advise you on specifics, but a bit of experimenting ought to help. — rmunn
I'd be curious to know how to do it. It seemed to me that the dynamic addition was indeed possible, but not for what I wanted to do, which seems slightly more complex to set up. — Foxy
@rmunn This doesn’t work with mutual recursion though, e.g. let rec (~+) a b = a ~* (b - 1) and (~*) a b = if b<0 then a else (a+b) ~+ b in ... because ~* is defined lexically after it is used. — Dan Robertson
The reason of the stack overflow error is because your grammar (expression and infixExpression) forms a left recursion and you need to refactor that into a right recursion. en.m.wikipedia.org/wiki/Left_recursion — KFL

Dan Robertson Dan Robertson · Accepted Answer · 2019-02-16T19:43:57

So you’re right that the hard part is precedence. I think there are approximately two ways to deal with it for an ML style language

Precedence is defined by fixed rules
Precedence is defined by the user

Ocaml does option 1. The precedence and associativity of an operator is defined by its first character.

Haskell does option 2. The precedence and associativity are defined with statements (and the declaration can come after the operator is used).

It is pretty simple to see how to parse (1): you just parse it normally except instead of only allowing the operator + at that precedence level, you define any operator starting with +. This leaves the question of how you should deal with parsing an expression like a +* b +- c. I don’t know how ocaml would associate this but my guess would be either based on the second character, or at the same precedence level (e.g. like parsing + and - at the same precedence level and associating to the left so a + b - c + d parses as ((a + b) - c) + d).

I think you also have the right idea for parsing (2) but it is tricky. I think your type is slightly wrong and what you actually want is something like:

type operator = Op of string
type expression =
  | Var of string
  | Operator of operator
  | App of expression * expression
  | Tuple of expression list
  | Infix of expression * (operator * expression) list

Specifically you can’t have Infix of expression * operator * expression because then how do you parse a OP b OP c? You have basically two choices:

Infix (Infix (Var a, Op OP, Var b), Op OP, Var c)
Infix (Var a, Op OP, Infix (Var b, Op OP, Var c))

Option 1 is equivalent to (a OP b) OP c and works for -, and |> but not a Haskell style $ and certainly not for a + b * c. Similarly option 2 works for + but not for - or /. Furthermore, it is not sufficient to just undo this mangling before sorting out precedence because the expression (a OP b) OP c must be parsed as option 1, even if it is unmangled.

Note we (if we want an ML style language) need a way to express the function of an operator as a value, e.g. (+) but this could eg be subsumed inside Var.

Once you have this level of parsing you can then wait until you have determined any operator precedence rules for the operators and then you can parse.

Some other things that might be worth considering:

Pre/postfix operators: Ocaml allows prefix operators iff they start with specific symbols, e.g. !. Haskell allows postfix operators as an extension but only using slices (I.e. the extension loosens the definition of (x*) from (\y -> (*) x y) to ((*) x), so (*) may take a single argument. If you want the ability to have pre and postfix operators be user defined, you could maybe change the type to drop application and the rule that you can have exactly one operator between expressions, and then have a step to parse a list of expression | operator into something sane, e.g. does a * + b parse as a (*(+b)), (a) * (+b), (a*) (+b), (a*) + (b), or ((a*)+) b? Maybe this difficulty would be bad for human readers too.
How to deal with precedence? In Haskell you choose an integer from 0 to 9. In perl6 you instead just say e.g * is tighter than +, and if two operators with an undefined relationship appear together, the language requires you put in parens.

It’s maybe worth noting the perl6 way as another option. In this one, operators must have their precedence and associativity/fixity defined before they are used, and the parser dynamically adds these between them being declared and used (one may also do this with the entire grammar of the language so having a parsing future expressions depend on evaluating earlier ones is slightly less crazy).

Parsing custom infix operators + implementation with FParsec

1 Answers