Arc Forum | Binary search in Arc

Arc Forum

Binary search in Arc

5 points by dpkendal 5203 days ago | 6 comments

    (def bsearch (stack needle (o f <)  (o offset 0))
      (if (is (len stack) 1)
        (if (is needle (car stack)) offset nil)
        (let (fst lst) (split stack (round (/ (len stack) 2)))
          (if (f (last fst) needle)
            (bsearch lst needle f (+ offset (len fst)))
            (bsearch fst needle f offset)))))

Searches for needle in stack. You can use a custom sort-comparison function as the third argument. offset is used internally to keep where we are in the whole list. Returns nil if there's no such item in the list.

(Of course, because Arc uses linked lists, this is actually considerably slower than linear iteration through the list, because every list access is multiplied by O(n). An interesting exercise nonetheless.)

2 points by rocketnia 5203 days ago | link

You're missing the case where the haystack is emtpy:

  arc> (bsearch '(0 1 4 9 16) 9)
  3
  arc> (bsearch '(0 1 4 9 16) 8)
  nil
  arc> (bsearch '() 8)
  Error: "<: expects type <real number> as 1st argument, given: 'nil; other arguments were: 8"

Even though fallintothis just did something like this, here's my take on improving 'bsearch a bit.

  (def <-to-== ((o < <))
    (fn (a b)
      (~or (< a b) (< b a))))
  
  (def bsearch (stack needle (o < <) (o == <-to-==.<)
                (o offset 0) (o len-stack len.stack))
    (case len-stack
      0  nil
      1  (when (== car.stack needle) offset)
         (withs (len-before (int:/ len-stack 2)
                 after (nthcdr len-before stack))
           (if (< needle car.after)
             (bsearch stack needle < == offset len-before)
             (bsearch after needle < ==
               (+ offset len-before) (- len-stack len-before))))))

  arc> (time:repeat 10000 (your-bsearch '(0 1 4 9 16) 9))
  time: 1629 msec.
  nil
  arc> (time:repeat 10000 (your-bsearch '(0 1 4 9 16) 8))
  time: 1610 msec.
  nil
  arc> (time:repeat 10000 (my-bsearch '(0 1 4 9 16) 9))
  time: 404 msec.
  nil
  arc> (time:repeat 10000 (my-bsearch '(0 1 4 9 16) 8))
  time: 397 msec.
  nil

(Note that I'm using Anarki here.)

I've made the equality comparison its own parameter, and it's based on the comparison function by default rather than being 'is. That way you can find 3.0 by searching for 3 (even though 3 and 3.0 aren't 'is), and you can pass in other comparators like 'iso instead.

I've put in the 'len-stack local and used the 'car of the later portion rather than the 'last of the earlier portion. That lets me avoid doing linear traversals any more than I have to. I don't know if this helps with the time complexity, 'cause the complexity's still linear in the length of the haystack: Even if the caller calculates 'len-stack in constant time (say, using 'qlen) and passes it explicitly, the uses of 'nthcdr ultimately do about as many cdrs as there are elements in the list.

The 'car-versus-'last change does affect the outcome of the algorithm a little: Mine finds the latest applicable index in the list, whereas yours finds the earliest.

I also used 'int instead of 'round. This is apparently a little bit faster (and should have no effect on the results or algorithmic complexity):

  arc> (time:repeat 100000 (int 10))
  time: 121 msec.
  nil
  arc> (time:repeat 100000 (int 21/2))
  time: 232 msec.
  nil
  arc> (time:repeat 100000 (round 10))
  time: 236 msec.
  nil
  arc> (time:repeat 100000 (round 21/2))
  time: 354 msec.
  nil

Furthermore, I've specifically chosen 'nthcdr over 'split so as to reduce consing (not just to reduce the uses of 'len). Now the only places I suspect memory will be allocated are the lambdas produced by 'withs (which oughta be something the Racket compiler can strip out), maybe possibly the arithmetic (if you're dealing with ridiculously gargantuan list lengths), and miscellaneous runtime overhead associated with things like function calls and variable lookup.

In fact, we probably do have some argument list consing to worry about: Since 'bsearch has optional arguments, it's compiled using 'ac-complex-fn, which makes it a varargs function as far as Racket's concerned.

Here's a version that implements the loop using a non-varargs function:

  (load "lib/util.arc")  ; for 'xloop
  
  ; same as before
  (def <-to-== ((o < <))
    (fn (a b)
      (~or (< a b) (< b a))))
  
  (def bsearch (stack needle (o < <) (o == <-to-==.<))
    (xloop (stack stack offset 0 len-stack len.stack)
      (case len-stack
        0  nil
        1  (when (== car.stack needle) offset)
           (withs (len-before (int:/ len-stack 2)
                   after (nthcdr len-before stack))
             (if (< needle car.after)
               (next stack offset len-before)
               (next after
                 (+ offset len-before) (- len-stack len-before)))))))

  arc> (time:repeat 10000 (bsearch '(0 1 4 9 16) 9))
  time: 341 msec.
  nil
  arc> (time:repeat 10000 (bsearch '(0 1 4 9 16) 8))
  time: 330 msec.
  nil

Looks like that does make things a little better.

And if I change + to $.+:

  arc> (time:repeat 10000 (bsearch '(0 1 4 9 16) 9))
  time: 327 msec.
  nil
  arc> (time:repeat 10000 (bsearch '(0 1 4 9 16) 8))
  time: 326 msec.
  nil

That's pretty negligible, but it doesn't seem to be quite within the range of error: I've tried this test a few times to account for things like the JIT warming up, and the $.+ version is consistently slightly better.

So yeah, this algorithm is linear in the length of the list, but it still lets you cut down on the number of times the comparison function is called, which is oftentimes the most important thing.

For binary search with insertion or removal, which is a likely operation if we're maintaining a sorted list at runtime, this algorithm is actually especially ideal. We end up with linear-in-length-times-logarithmic-in-comparisons time (plus an allocation for insertion), which is kinda the same as the time complexity on a dynamic array.

To break through these barriers any further, we'd probably need to adapt the surrounding algorithm so it used a fixed-size data structure (cutting down on allocations) and/or a heap (cutting down on the linear-in-length time factor) and/or a more informative model for the the haystack's distribution (so we could use things like Newton's method for better-than-logarithmic search). But this isn't something the implementor of 'bsearch has control over.

-----

1 point by dpkendal 5202 days ago | link

> You're missing the case where the haystack is emtpy [sic]

D'oh! That was an obvious case to miss. Thanks for your optimisations, I'll look into combining these with fallintothis's.

-----

2 points by fallintothis 5203 days ago | link

Of course, because Arc uses linked lists, this is actually considerably slower than linear iteration through the list, because every list access is multiplied by O(n).

True that. ;)

An interesting exercise nonetheless.

To this end, here are a few Arc tricks.

- Instead of

  (is (len stack) 1)

you can use

  (single stack)

which is defined as

  (def single (x) (and (acons x) (no (cdr x))))

I.e., a linked list has length 1 if it doesn't have a cdr. This makes (single stack) read nicely, but we already know stack is a linked list, so we could also just say

  (~cdr stack) ; == (no:cdr stack) == (no (cdr stack))

- Instead of

  (is needle (car stack))

you could say

  (caris stack needle)

which reads a little nicer, though the definition once more has an acons check:

  (def caris (x val) 
    (and (acons x) (is (car x) val)))

- Notice that the pattern

  (if a b nil)

is the same as

  (and a b)

which potentially reads more clearly:

  (def bsearch (stack needle (o f <)  (o offset 0))
    (if (single stack)
        (and (caris stack needle)
             offset)
        (let (fst lst) (split stack (round (/ (len stack) 2)))
          (if (f (last fst) needle)
              (bsearch lst needle f (+ offset (len fst)))
              (bsearch fst needle f offset)))))

- Since "offset is used internally to keep where we are in the whole list", we could make the parameter internal to an anonymous function, then call that function on the "external" arguments. Using afn (anaphoric fn; http://en.wikipedia.org/wiki/Anaphora_%28linguistics%29), we can recurse in the anonymous function using the name self. This winds up being simple, but kind of ugly.

  (def bsearch (stack needle (o f <))
    ((afn (stack offset)
       (if (single stack)
           (and (caris stack needle)
                offset)
           (let (fst lst) (split stack (round (/ (len stack) 2)))
             (if (f (last fst) needle)
                 (self lst (+ offset (len fst)))
                 (self fst offset)))))
     stack 0))

We wind up needing to supply the initial arguments to afn all the way at the bottom, wrapping the whole thing in another set of parentheses for the call. This prompted utlities like http://arclanguage.org/item?id=10055.

-----

1 point by zck 5202 days ago | link

>- Notice that the pattern

  (if a b nil)

>is the same as

  (and a b)

For this, I prefer

  (when a b)

>This makes `(single stack)` read nicely, but we already know stack is a linked list, so we could also just say

  (~cdr stack) ; == (no:cdr stack) == (no (cdr stack))

I wouldn't want to go to this optimization until finding out that it would significantly increase performance, but that argument hinges on thinking it's slightly less readable than `(single stack)`.

-----

3 points by shader 5202 days ago | link

Note that you don't need the nil at the end of the if, so

  (if a b)

also works.

-----

1 point by dpkendal 5202 days ago | link

Either of these are neater than using `and`, which for clarity is better only used as a predicate, in my opinion.

Thanks fallintothis, zck, and shader for your improvements. I'm trying to take everyone's suggestions into account to make a new version, which I'll put in Anarki when I'm satisfied with it.

-----