I thought from looking at the internals before that the ar_funcall overhead would be the main factor. However, it turns out that of the 40 seconds, about 25 are spent in _<, _+ takes about 9 seconds, ar-funcalls about 2.5 seconds, ar-false? and _- about 1 second each.
So it looks like arc< is the main time sink, although the others all contribute.
I think checking the type of all the arguments is costly.
In case anyone is interested, the Scheme code assigned to _fib is:
(if (not (ar-false? (ar-funcall2 _< n 2))) 1
(ar-funcall2 _+ (ar-funcall1 _fib (ar-funcall2 _- n 1))
(ar-funcall1 _fib (ar-funcall2 _- n 2)))))
I don't know much about MzScheme internals. But from my experience of implementing Gauche Scheme, inlining basic operators such as <, +, and - is a big performance boost, and I suspect MzScheme does similar thing as far as these ops are not redefined. Calling them many times via 'wrapper' function like ac.scm does diminishes that effect.
That said, one possible overhead is that, although arc compiler tries to use ar-funcall2 to avoid consing the arguments, they are consed after all since _+ and _< are defined to take a list of arguments.
The original (time (fib 34)) took 79.8s on my machine.
Changing arc< in ac.scm with this:
(cond [(and (number? a) (number? b)) (< a b)]
[(and (string? a) (string? b)) (string<? a b)]
[(and (symbol? a) (symbol? b)) (string<? (symbol->string a)
[(and (char? a) (char? b)) (char<? a b)]
[else (< a b)])]
(cond ((all number? args) (apply < args))
((all string? args) (pairwise string<? args #f))
((all symbol? args) (pairwise (lambda (x y)
(string<? (symbol->string x)
((all char? args) (pairwise char<? args #f))
(#t (apply < args)))]))
made (time (fib 34)) 72.8s, and further changing _+ with this:
(cond [(and (string? a) (string? b)) (string-append a b)]
[(and (arc-list? a) (arc-list? b))
(ac-niltree (append (ar-nil-terminate a)
[else (+ a b)])]
(cond [(all string? args)
(apply string-append args)]
[(all arc-list? args)
(ac-niltree (apply append (map ar-nil-terminate args)))]
[else (apply + args)])]))
made (time (fib 34)) 49.5s.
But generally, these kind of tuning for microbenchmarks only have small effect in real applications, for microbenchmarks magnifies inefficiency of very specific parts.
(Afterthought: It won't be too difficult to write a Scheme macro that expands variable-arity type-dispatching functions like _+ or _+ into case-lambda form. Then this kind of optimization can be applied without cluttering the source too much.)