This paper claims to make a certain Lisp program run faster than its C equivalent. Trying to reproduce the results, I was able to get close (Lisp is 50% slower than C) but wanted to know if anyone knows ways to squeeze more perf out of SBCL 1.3.1.
The target problem is adding a constant single float to every cell in an 800 x 800 array of single floats. The method is to write the program in C and in Common Lisp and compare the times. Using this portable timer, The C code is as follows:
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include "./modules/tictoc/tictoc.h"
const int HORZ = 800;
const int VERT = 800;
#define PERF_REPS 1000
typedef float DATA_T;
struct image_s {
size_t n;
size_t w, h;
DATA_T * data;
};
typedef struct image_s image;
image * make_image (size_t w, size_t h) {
size_t n = w * h;
DATA_T * data = (DATA_T *)malloc(sizeof(DATA_T) * n);
assert (NULL != data);
image * result = (image *)malloc(sizeof(image));
assert (NULL != result);
result->n = n;
result->w = w;
result->h = h;
result->data = data;
return result;
}
void free_image (image * it) {
assert (NULL != it);
assert (NULL != it->data);
free (it->data);
free (it);
}
image * init_to_value (image * it, DATA_T val) {
assert (NULL != it);
assert (NULL != it->data);
size_t i;
const size_t n = it->n;
for (i = 0; i < n; ++i) {
it->data[i] = val;
}
return it;
}
void add (image * to, image * from, DATA_T val) {
assert (NULL != to);
assert (NULL != to->data);
assert (NULL != from);
assert (NULL != from->data);
size_t i;
const size_t n = to->n;
for (i = 0; i < n; ++i) {
to->data[i] = from->data[i] + val;
}
}
int main (int argc, char ** argv) {
image * from = init_to_value (make_image (HORZ, VERT), 0.0f);
image * to = init_to_value (make_image (HORZ, VERT), 0.0f);
TicTocTimer clock = tic();
for (size_t i = 0; i < PERF_REPS; ++i)
add (to, from, 42.0);
printf("Elapsed time %f seconds.\n",toc(&clock));
free_image (to);
free_image (from);
return 0;
}
I compile and run the code as follows:
gcc -O3 image-add.c ./modules/tictoc/libtictoc.a && ./a.out
A typical time on my mac book pro is about 0.178 seconds. Pretty nice.
The equivalent Lisp code, using every option I was able to find in the Lisp hyperspec, in the new book Common Lisp Recipes, and in the SBCL user manual, is as follows. The comments indicate some things I tried that did not make a difference.
;;; None of the commented-out declarations made any difference in speed.
(declaim (optimize speed (safety 0)))
(defun image-add (to from val)
(declare (type (simple-array single-float (*))
to from))
(declare (type single-float val))
(let ((size (array-dimension to 0)))
;(declare (type fixnum size))
(dotimes (i size)
;(declare (type fixnum i))
(setf (aref to i) (+ (aref from i) val)))))
(defparameter HORZ 800)
(defparameter VERT 800)
(defparameter PERF-REPS 1000)
(let ((to (make-array (* HORZ VERT) :element-type 'single-float))
(fm (make-array (* HORZ VERT) :element-type 'single-float)))
;(declare (type fixnum HORZ))
;(declare (type fixnum VERT))
(time (dotimes (_ PERF-REPS)
;(declare (type fixnum PERF-REPS))
;(declare (type fixnum _))
;(declare (inline image-add))
(image-add to fm 42.0))))
I compile and run it as follows:
sbcl --script image-perf.lisp
and typical run times are 0.276. Not bad, but I want it much better. Of course, the whole point of the exercise is that the Lisp code is shorter, but does anyone know a way to make it as fast or faster?
compile-file
. Loading source code compiles forms individually. The file compiler may do more. – Rainer Joswig(declaim (inline image-add))
before the definition of the function (and recompile all call sites), I see a speedup (from 0.560s to 0.364s). Could you test that? – coredump(map-into to (lambda (x) (+ x 42f0)) fm)
, which is much shorter. The obvious way to optimize this micro-benchmark code is to forget about the from array, which is constantly filled with zeros ;-) Note also that you could parallelize easily the code without major modifications. Replacingmap-into
withpmap-into
with 4 workers on my machine gives 0.176s of real-time. – coredump(proclaim '(inline image-add))
(I couldn't get adeclaim
to work), along with @Rainer Joswig's suggestion tocompile-file
and then load the .fasl file. Neither of these techniques produced appreciable speedup. I run about .300 seconds, give or take with or without these treatments, versus C's ~0.175. I will try multithreading next. Is it possible that gcc -O3 automatically detects available parallelism from the presence of constants and generates parallel code. I will figure out how to disassemble it and take a look. – Reb.Cabin