Arc Forum | Okay, some interesting info: Racket is apparently quite slow at applying a funct...

Arc Forum

2 points by Pauan 4943 days ago | link | parent

Okay, some interesting info: Racket is apparently quite slow at applying a function at runtime. I ran this code in Arc 3.1, ar, and Nu:

  (time (repeat 1000000 (let (a b) (list 1 2) a)))

Here are my results:

  Nu      time: 3165 msec.
  Arc 3.1 time: 1830 msec.
  ar      time: 1391 msec.

Very surprising! Arc 3.1 expands into a Racket let* and Nu expands into a Racket lambda... but ar's code is implemented in Arc, so it expands into multiple nested Arc lambdas... yet somehow manages to be faster than Arc 3.1 and Nu.

I am not sure how plain lambdas are faster than Racket's built-in let* . In any case, I changed my code so it expands into let* and now Nu gets 1649 msec, which is comparable to ar. Thus the problem is clearly applying a function at runtime.

Just to remove some more variables, I changed the function so it didn't use rest or optional args, but it ended up being almost as slow, so I know it's not that.

1 point by Pauan 4943 days ago | link

Hm... perhaps it's not the fact that it's applying, per se. Perhaps it's because Racket has to create a function every single time. In other words, if Racket sees this:

  ((fn (x) ...) ...)

It can try and optimize it away so a function is never actually created. But when Racket sees this:

  (apply (fn (x) ...) ...)

It can't do that optimization... thus it's forced to create a function. Creating and destroying a function every time the let is called could indeed cause a performance problem.

---

EDIT: to test my hypothesis, I used this code:

  (let foo (fn ((o a) (o b) . rest) a)
    (time (repeat 1000000 (let x (list 1 2) (apply foo x)))))

...unfortunately, it ended up getting 2966 msec, so it seems the problem is indeed Racket's apply, or perhaps my compiler's apply. I'll see if I can optimize it a bit.

-----

2 points by waterhouse 4937 days ago | link

Racket seems fairly good at "lambda lifting", if that is the correct term. To demonstrate with adding the numbers from 1 to n: version 0 should pretty obviously be compiled into a loop; versions 1 and 2 are less obvious. In 64-bit "racket", versions 1 and 2 seem to take twice as long as the loop in version 0, but that's still much better than allocating lambdas; in 32-bit, the "twice as long" difference is smothered by the cost of allocating/GCing bignums. The last version is actually written in Arc, and 25x slower in arc3.1; though in 32-bit, the cost of handling bignums makes arc3.1 only 2x slower. The results of the 64-bit version seem to demonstrate that Racket successfully avoided allocating closures at runtime in all cases.

  (define (sum0 n)
    (let loop ((n n) (tt 0))
      (if (zero? n)
          tt
          (loop (- n 1) (+ n tt)))))
  (define sum1
    (λ (n)
      ((λ (f n tt)
         (if (zero? n)
             tt
             (f f (- n 1) (+ n tt))))
       (λ (f n tt)
         (if (zero? n)
             tt
             (f f (- n 1) (+ n tt))))
       n
       0)))
  (define sum2
    (λ (n)
      ((λ (f) (f f n 0))
       (λ (f n tt)
         (if (zero? n)
             tt
             (f f (- n 1) (+ n tt)))))))
  (= sum3
     (fn (n)
       ((fn (f n tt)
          (if (is n 0)
              tt
              (f f (- n 1) (+ n tt))))
        (fn (f n tt)
          (if (is n 0)
              tt
              (f f (- n 1) (+ n tt))))
        n
        0)))
  
  ;Paste this command in, but copy the above to clipboard before running:
  arc> (let xs (readall:pbpaste) (map [eval (list '$ _)] butlast.xs) (eval last.xs)
  (each x '(sum0 sum1 sum2 _sum3) (repeat 2 (time:eval `(($ ,x) 10000000)))))

  ;64-bit racket v5.2.0.3: no mallocing beyond initial compilation
  time: 41 cpu: 41 gc: 0 mem: 25720
  time: 40 cpu: 41 gc: 0 mem: 6096
  time: 80 cpu: 80 gc: 0 mem: 7576
  time: 80 cpu: 80 gc: 0 mem: 6136
  time: 81 cpu: 80 gc: 0 mem: 7576
  time: 80 cpu: 80 gc: 0 mem: 6096
  time: 1026 cpu: 1027 gc: 0 mem: 7408
  time: 1018 cpu: 1019 gc: 0 mem: 6112

  ;32-bit racket v5.1.3.10: runtime is dominated by consing bignums
  time: 894 cpu: 892 gc: 24 mem: 1478560
  time: 872 cpu: 872 gc: 16 mem: 1236500
  time: 841 cpu: 841 gc: 15 mem: 1238156
  time: 844 cpu: 843 gc: 17 mem: 1236476
  time: 839 cpu: 839 gc: 15 mem: 1237300
  time: 838 cpu: 837 gc: 15 mem: -15541124
  time: 1857 cpu: 1857 gc: 18 mem: 1237784
  time: 1864 cpu: 1864 gc: 17 mem: 1236436

-----

1 point by Pauan 4943 days ago | link

Update: I just tested this code:

  (time (repeat 1000000 (apply (fn (a) a) (list 1))))

And the times:

  Nu time:      2875 msec.
  ar time:      1776 msec.
  Arc 3.1 time: 1725 msec.

So it seems the problem is with Nu's apply, not with Racket's apply. That's good, since that means I should be able to fix the problem.

-----

2 points by Pauan 4942 days ago | link

I haven't fixed `apply` yet, but I did put in a workaround. Using Racket's `apply`, Nu is actually faster than Arc 3.1 and ar!

   (timeit (let a 1 a))
   
  ar      time: 8.101  gc: 0.268  mem:   973.008
  Nu      time: 6.923  gc: 0.0    mem:    88.736
  Arc 3.1 time: 4.77   gc: 0.28   mem: -3305.9


   (timeit (let a (list 1 2) (car a)))

  Arc 3.1 time: 17.3    gc: 0.86   mem: -7,759.58
  ar      time: 10.303  gc: 0.552  mem: 1258.64
  Nu      time:  8.158  gc: 0.196  mem: -515.648


   (timeit (let (a b) (list 1 2) a))

  Arc 3.1 time: 17.47   gc: 1.0    mem:  -6997.07
  ar      time: 13.166  gc: 0.696  mem: -16510.112
  Nu      time: 12.102  gc: 0.512  mem: -10028.488

So, it seems my idea of applying nested functions to implement destructuring is good in essentially every way: faster, shorter, and easier to implement.

Interestingly, judging by the data above, it would seem Arc 3.1 is very slow at creating lists, probably because `list` is implemented in arc.arc, whereas ar and Nu provide faster implementations.

---

Now let's test optional args:

   (timeit ((fn (a) a) 1))
  
  ar      time: 7.534  gc: 0.352  mem:   866.232
  Nu      time: 6.976  gc: 0.0    mem:    88.368
  Arc 3.1 time: 4.78   gc: 0.28   mem: -3295.58
  
  
   (timeit ((fn (a (o b)) a) 1))
  
  ar      time: 14.493  gc: 0.464  mem:  1639.648
  Nu      time:  7.903  gc: 0.248  mem: -1664.792
  Arc 3.1 time:  5.84   gc: 0.36   mem: -2097.19


   Overhead
  ar      - 6.959
  Arc 3.1 - 1.06
  Nu      - 0.927

As you can see, in Nu and Arc 3.1, there's very little overhead from optional args, but in ar, optional args are quite costly.

-----

1 point by Pauan 4942 days ago | link

Update: I didn't want to be unfair to Arc 3.1 because of its slow implementation of `list`, so I redid the tests using `quote` instead:

   (timeit (let a '(1 2) (car a)))
  
  Nu      time: 10.628  gc: 0.196  mem: -1747.08
  ar      time: 8.529   gc: 0.252  mem:   967.432
  Arc 3.1 time: 5.26    gc: 0.34   mem:  4952.98
  
  
   (timeit (let (a b) '(1 2) a))
   
  Nu      time: 14.066   gc: 0.52   mem:    90.504
  ar      time: 13.305   gc: 0.376  mem: -9236.904
  Arc 3.1 time: 6.79     gc: 0.35   mem: -2,093.93


   Overhead
  ar      - 4.776
  Nu      - 3.438
  Arc 3.1 - 1.53

As expected, Arc 3.1 is miles ahead of both ar and Nu. Interestingly, Nu is now listed as slower than ar... it would appear that either Nu has a faster implementation of `list`, a slower implementation of `quote`, or possibly both. In any case, this demonstrates that applying nested functions should be approximately the same as complex fns in terms of speed.

One thing I noticed is that Nu has drastically greater overhead than Arc 3.1, but less than ar.

-----

1 point by Pauan 4941 days ago | link

It seems the problem was that quote was slow in Nu. I've fixed that, so here's the new times:

   (timeit (let a '(1 2) (car a)))
  
  ar      time: 8.613  gc: 0.308  mem:  460.696
  Nu      time: 7.671  gc: 0.0    mem:   88.976
  Arc 3.1 time: 5.33   gc: 0.35   mem: 5050.25
  
  
   (timeit (let (a b) '(1 2) a))
   
  ar      time: 12.111  gc: 0.436  mem: -19278.128
  Nu      time: 11.438  gc: 0.324  mem:   1435.016  (apply fn)
  Nu      time: 8.96    gc: 0.0    mem:    125.352  (Racket let*)
  Arc 3.1 time: 7.0     gc: 0.35   mem:  -2124.82


   Overhead
  ar      - 3.498
  Arc 3.1 - 1.67
  Nu      - 1.289

Nu now has the lowest overhead out of the three...! Also note that Nu does not spend any time in garbage collection, unlike ar and Arc 3.1.

This seems to be a common trend: Nu either spending no time in garbage collection, or less time than ar and Arc 3.1. Not sure how important that is compared to raw speed, but it's nice.

Unfortunately, this also demonstrates that applying nested functions is slower than using a Racket let*. So the reason Nu won the speed contest earlier wasn't because of my destructuring idea: Nu was just plain faster than ar in general.

-----

2 points by Pauan 4943 days ago | link

And since I'm in a timing mood, here's the times for optional args:

  (time (repeat 1000000 ((fn (a (o b 3)) (list a b)) 1 2)))

  Arc 3.1 time: 1828 msec.
  ar time:      1814 msec.
  Nu time:      1554 msec.

So, it does make a difference that Nu uses plain lambdas, rather than complex fns! Now I just need to get apply to be faster.

---

On a related note: racket-set! is slow. Using racket-let or Arc's let is faster, by a fairly significant amount. So I'll be changing my compiler so it doesn't use mutation.

-----