Hm... perhaps it's not the fact that it's applying, per se. Perhaps it's because Racket has to create a function every single time. In other words, if Racket sees this:
((fn (x) ...) ...)
It can try and optimize it away so a function is never actually created. But when Racket sees this:
(apply (fn (x) ...) ...)
It can't do that optimization... thus it's forced to create a function. Creating and destroying a function every time the let is called could indeed cause a performance problem.
---
EDIT: to test my hypothesis, I used this code:
(let foo (fn ((o a) (o b) . rest) a)
(time (repeat 1000000 (let x (list 1 2) (apply foo x)))))
...unfortunately, it ended up getting 2966 msec, so it seems the problem is indeed Racket's apply, or perhaps my compiler's apply. I'll see if I can optimize it a bit.
Racket seems fairly good at "lambda lifting", if that is the correct term. To demonstrate with adding the numbers from 1 to n: version 0 should pretty obviously be compiled into a loop; versions 1 and 2 are less obvious. In 64-bit "racket", versions 1 and 2 seem to take twice as long as the loop in version 0, but that's still much better than allocating lambdas; in 32-bit, the "twice as long" difference is smothered by the cost of allocating/GCing bignums. The last version is actually written in Arc, and 25x slower in arc3.1; though in 32-bit, the cost of handling bignums makes arc3.1 only 2x slower. The results of the 64-bit version seem to demonstrate that Racket successfully avoided allocating closures at runtime in all cases.
(define (sum0 n)
(let loop ((n n) (tt 0))
(if (zero? n)
tt
(loop (- n 1) (+ n tt)))))
(define sum1
(λ (n)
((λ (f n tt)
(if (zero? n)
tt
(f f (- n 1) (+ n tt))))
(λ (f n tt)
(if (zero? n)
tt
(f f (- n 1) (+ n tt))))
n
0)))
(define sum2
(λ (n)
((λ (f) (f f n 0))
(λ (f n tt)
(if (zero? n)
tt
(f f (- n 1) (+ n tt)))))))
(= sum3
(fn (n)
((fn (f n tt)
(if (is n 0)
tt
(f f (- n 1) (+ n tt))))
(fn (f n tt)
(if (is n 0)
tt
(f f (- n 1) (+ n tt))))
n
0)))
;Paste this command in, but copy the above to clipboard before running:
arc> (let xs (readall:pbpaste) (map [eval (list '$ _)] butlast.xs) (eval last.xs)
(each x '(sum0 sum1 sum2 _sum3) (repeat 2 (time:eval `(($ ,x) 10000000)))))
;64-bit racket v5.2.0.3: no mallocing beyond initial compilation
time: 41 cpu: 41 gc: 0 mem: 25720
time: 40 cpu: 41 gc: 0 mem: 6096
time: 80 cpu: 80 gc: 0 mem: 7576
time: 80 cpu: 80 gc: 0 mem: 6136
time: 81 cpu: 80 gc: 0 mem: 7576
time: 80 cpu: 80 gc: 0 mem: 6096
time: 1026 cpu: 1027 gc: 0 mem: 7408
time: 1018 cpu: 1019 gc: 0 mem: 6112
;32-bit racket v5.1.3.10: runtime is dominated by consing bignums
time: 894 cpu: 892 gc: 24 mem: 1478560
time: 872 cpu: 872 gc: 16 mem: 1236500
time: 841 cpu: 841 gc: 15 mem: 1238156
time: 844 cpu: 843 gc: 17 mem: 1236476
time: 839 cpu: 839 gc: 15 mem: 1237300
time: 838 cpu: 837 gc: 15 mem: -15541124
time: 1857 cpu: 1857 gc: 18 mem: 1237784
time: 1864 cpu: 1864 gc: 17 mem: 1236436
I haven't fixed `apply` yet, but I did put in a workaround. Using Racket's `apply`, Nu is actually faster than Arc 3.1 and ar!
(timeit (let a 1 a))
ar time: 8.101 gc: 0.268 mem: 973.008
Nu time: 6.923 gc: 0.0 mem: 88.736
Arc 3.1 time: 4.77 gc: 0.28 mem: -3305.9
(timeit (let a (list 1 2) (car a)))
Arc 3.1 time: 17.3 gc: 0.86 mem: -7,759.58
ar time: 10.303 gc: 0.552 mem: 1258.64
Nu time: 8.158 gc: 0.196 mem: -515.648
(timeit (let (a b) (list 1 2) a))
Arc 3.1 time: 17.47 gc: 1.0 mem: -6997.07
ar time: 13.166 gc: 0.696 mem: -16510.112
Nu time: 12.102 gc: 0.512 mem: -10028.488
So, it seems my idea of applying nested functions to implement destructuring is good in essentially every way: faster, shorter, and easier to implement.
Interestingly, judging by the data above, it would seem Arc 3.1 is very slow at creating lists, probably because `list` is implemented in arc.arc, whereas ar and Nu provide faster implementations.
---
Now let's test optional args:
(timeit ((fn (a) a) 1))
ar time: 7.534 gc: 0.352 mem: 866.232
Nu time: 6.976 gc: 0.0 mem: 88.368
Arc 3.1 time: 4.78 gc: 0.28 mem: -3295.58
(timeit ((fn (a (o b)) a) 1))
ar time: 14.493 gc: 0.464 mem: 1639.648
Nu time: 7.903 gc: 0.248 mem: -1664.792
Arc 3.1 time: 5.84 gc: 0.36 mem: -2097.19
Overhead
ar - 6.959
Arc 3.1 - 1.06
Nu - 0.927
As you can see, in Nu and Arc 3.1, there's very little overhead from optional args, but in ar, optional args are quite costly.
Update: I didn't want to be unfair to Arc 3.1 because of its slow implementation of `list`, so I redid the tests using `quote` instead:
(timeit (let a '(1 2) (car a)))
Nu time: 10.628 gc: 0.196 mem: -1747.08
ar time: 8.529 gc: 0.252 mem: 967.432
Arc 3.1 time: 5.26 gc: 0.34 mem: 4952.98
(timeit (let (a b) '(1 2) a))
Nu time: 14.066 gc: 0.52 mem: 90.504
ar time: 13.305 gc: 0.376 mem: -9236.904
Arc 3.1 time: 6.79 gc: 0.35 mem: -2,093.93
Overhead
ar - 4.776
Nu - 3.438
Arc 3.1 - 1.53
As expected, Arc 3.1 is miles ahead of both ar and Nu. Interestingly, Nu is now listed as slower than ar... it would appear that either Nu has a faster implementation of `list`, a slower implementation of `quote`, or possibly both. In any case, this demonstrates that applying nested functions should be approximately the same as complex fns in terms of speed.
One thing I noticed is that Nu has drastically greater overhead than Arc 3.1, but less than ar.
It seems the problem was that quote was slow in Nu. I've fixed that, so here's the new times:
(timeit (let a '(1 2) (car a)))
ar time: 8.613 gc: 0.308 mem: 460.696
Nu time: 7.671 gc: 0.0 mem: 88.976
Arc 3.1 time: 5.33 gc: 0.35 mem: 5050.25
(timeit (let (a b) '(1 2) a))
ar time: 12.111 gc: 0.436 mem: -19278.128
Nu time: 11.438 gc: 0.324 mem: 1435.016 (apply fn)
Nu time: 8.96 gc: 0.0 mem: 125.352 (Racket let*)
Arc 3.1 time: 7.0 gc: 0.35 mem: -2124.82
Overhead
ar - 3.498
Arc 3.1 - 1.67
Nu - 1.289
Nu now has the lowest overhead out of the three...! Also note that Nu does not spend any time in garbage collection, unlike ar and Arc 3.1.
This seems to be a common trend: Nu either spending no time in garbage collection, or less time than ar and Arc 3.1. Not sure how important that is compared to raw speed, but it's nice.
Unfortunately, this also demonstrates that applying nested functions is slower than using a Racket let*. So the reason Nu won the speed contest earlier wasn't because of my destructuring idea: Nu was just plain faster than ar in general.
And since I'm in a timing mood, here's the times for optional args:
(time (repeat 1000000 ((fn (a (o b 3)) (list a b)) 1 2)))
Arc 3.1 time: 1828 msec.
ar time: 1814 msec.
Nu time: 1554 msec.
So, it does make a difference that Nu uses plain lambdas, rather than complex fns! Now I just need to get apply to be faster.
---
On a related note: racket-set! is slow. Using racket-let or Arc's let is faster, by a fairly significant amount. So I'll be changing my compiler so it doesn't use mutation.