6
votes

Assume I have a mathematical expression in a "tree" form in OCaml. It's represented as an algebraic type like this:

type expr = 
   Number of int
  |Plus of expr*expr

Well, this is a very simplified definition, but it's enough to describe the problem.

I want to convert it to a string in a reverse polish notation, so that Plus (Number i, Number j) becomes (+ i j). A straightforward implementation would be

let rec convert = function
   Number i -> string_of_int i
  |Plus (a,b) -> (let s = convert a in let p = convert b in "(+"^s^" "^p^")")

But the thing is that it's incredibly slow on some input (that have a big tree depth). For example, this input works 5 seconds on my machine:

let rec make_pain so_far = function
  0 -> so_far |i -> make_pain (Plus (Number 1,so_far)) (i-1)

let pain = make_pain (Number 1) 20000

let converted = convert pain

It seems that string concatenation x^y, where y is a long string, is the performance problem. Indeed, if I replace the "(+"^s^" "^p^")" expression with mere s^p, it becomes much faster.

Using printf instead of concatenation doesn't make it any faster. Converting to C might help, but isn't there an OCaml solution?

1
@Chris yeah, the issue's as old as C :)P Shved

1 Answers

10
votes

Use a string Buffer.

^ is defined as,

let (^) s1 s2 =
  let l1 = string_length s1 and l2 = string_length s2 in
  let s = string_create (l1 + l2) in
  string_blit s1 0 s 0 l1;
  string_blit s2 0 s l1 l2;
  s

What you are doing is creating a new string each time and copying the old values in. Pretty much standard in any language where strings are represented as character arrays. The hangup happens because you are doing this FOUR times for each node (there isn't an optimization for multiple ^ calls)! As for a buffer, it will create a giant string and continually fill it in as managed by the data structure,

 type t =
   {mutable buffer : string;
    mutable position : int;
    mutable length : int;
    initial_buffer : string}

Even if you decide to create the initial buffer size to 1, the resize function will adjust the size in a way that will limit the number of re-allocations. For example, the add_string function will increase the size of the array by len*2^(n+p-len), where n is the length of the new string, p is the position and len is the length of the original buffer --only if the buffer cannot support the string, of course. So, the size of the buffer grows exponentially and there will be few re-allocations as things progress. Of course, it's best to set the initial buffer to something reasonable, but it isn't necessary.

The new convert function wouldn't look much more verbose:

let rec convert buf ex =
  let addb = Buffer.add_string buf in
  match ex with
   Number i -> addb (string_of_int i)
  |Plus (a,b) -> (addb "(+ "; convert buf a; addb " "; convert buf b; addb ")")