Arc Forumnew | comments | leaders | submitlogin
Hash tables are unnecessary
10 points by ksvanhorn 5868 days ago | 9 comments
If I understand correctly, a design goal for arc is to have everything built up from a small, simple set of fundamental "axioms". Thus it is rather jarring to see a complex data structure (hash tables) bolted onto the language at a fundamental level. Let me suggest two alternatives that would allow one to create this functionality from simpler alternatives.

1. Introduce arrays as a data structure in Arc. I'm not sure why they've been left out. Once you have arrays, you can build hash tables from them. Note that, if we consider a dotted pair to be an array of length two, then arrays are the only data structure we need, as you can build up lists from dotted pairs.

2. Replace hash tables with purely functional data structures such as the Map and Set types of F# (I assume O'Caml has these too). The underlying data structures for these are red-black trees, which you can build up from lists. You can take one of these maps m and create a new map m' that has one more or one fewer key/value pairs, without modifying he original map, and this only takes O(log n) time. Likewise, looking up a value from a key takes only O(log n) time.

At this point, let me comment on the following lines in the tutorial:

"I once thought alists were just a hack, but there are many things you can do with them that you can't do with hash tables, including sort them, build them up incrementally in recursive functions, have several that share the same tail, and preserve old values."

You can do all of these things with purely functional maps. Futhermore, lookup time is exponentially faster. Even if computation speed isn't a high priority for Arc, you can't ignore the difference between O(log n) access time and O(n) access time.

Similarly, how would you like a purely functional array-like data structure where you can do all of the following in O(log n) time, without modifying the original array:

a) create a new array that is the same as the old except at one index;

b) extract a subsequence of the original array;

c) create a new array that is the catenation of two existing arrays.

If I understand correctly, such a data structure could be created by generalizing the rope data structure (http://www.sgi.com/tech/stl/Rope.html), invented by Alexander Stepanov and Matthew Austern, to hold arbitrary values (instead of just characters).



7 points by absz 5868 days ago | link

You raise some interesting points, but first, I have to address a confusion I've seem come up before. It is this: arrays are not lists. Arrays are a data structure with O(1) indexing, designed to be accessed and not iterated over. Cons cells/linked lists are a data structure with O(n) indexing, designed to be iterated over and not accessed. What's more, if we have a dotted pair be an array of length two, we get

  Cons             vs.          Array
  -----------------------------------
  (a . b)         <---> [a, b]
  nil / ()        <---> nil / []
  (a . nil) / (a) <---> [a, []]
  (a . (b . nil)) <---> [a, [b, nil]]
Arrays and lists have fundamentally different structures (one emphasizes first vs. rest, the other emphasizes here vs. there), and this is why mzscheme / PLT Scheme separates vectors (its arrays) and cons cells.

As for content: it seems to me that your title overstates your point. You appear to be arguing for the removal of hash tables as an axiom, which is different from arguing that they are "unnecessary". (To be fair, your second argument does propose to eliminate them, but you replace them with something with equivalent access "terminology" (though not speed nor mutability).) Hash tables (or really any sort of dictionary) and closures/anonymous functions are, in my opinion, the two most useful things in programming, so I feel that they should be part of the language; whether in ac.scm or in arc.arc doesn't matter to me as an Arc user. However, as someone who cares about the implementation, your point about simplifying the axioms and examining speed tradeoffs is a good one, and is worth considering (though I'll leave it for wiser heads than mine). It would be nice if we could get all our complex data structures out of one (or a few) axiom(s), and perhaps arrays would suffice; after all, they do for C.

-----

6 points by almkglor 5868 days ago | link

1) http://arclanguage.com/item?id=4146 . Might actually implement this in arc2c if I stop being lazy some time soon. And definitely after implementing arc2c macros.

What arc needs is a concept of "sequence", not separate array and list types. Yes, I know, I know, "lists are cons cells". Hahahaha. Well, 'map, 'some, 'all, 'keep, 'rem don't work on improper lists anyway. Improper lists are rarely used, and the implementation I propose can support them anyway, just not as efficiently. So some loss of efficiency in 'scdr is acceptable, at least you get O(n/m) index time, with O(1 + C) if m > n.

2) Sounds interesting. You could implement this using Arc lists as base, and use my lib/settable-fn.arc extension (or lib/settable-fn2.arc, which is fractionally more likely to get added into ArcN and which I intend to use for arc2c in lieu of my settable-fn) to support some of the mutation stuff. See also my "Create your own collection" series.

-----

8 points by KirinDave 5866 days ago | link

An associative array primitive is one of the major benefits of Perl and Ruby for writing readable code that still runs in a reasonable amount of time (not fast).

Removing it for conceptual purity weakens the language. The hash table was complex years ago, but by now we should come to regard it as a primitive.

-----

3 points by almkglor 5868 days ago | link

Re: ropes - here's a discussion of them http://www.google.com/url?sa=t&ct=res&cd=1&url=h...

-----

3 points by stefano 5867 days ago | link

1. Hash tables gives you O(1) access with lesser constant times than red-black-trees, that coud need a few rotations after every insert/remove.

2. It's better to use cons cells for list construction rather than vectors, because vectors have got an extra cell to store the length of the vector itself. On 32 bit machines a cons cell takes 8 bytes, a vector of 2 elements takes 12 byte i.e. 50% more than the cons cell.

-----

4 points by almkglor 5867 days ago | link

2) In theory, the length of the vector could be stored by the memory manager anyway (possibly as a start and end address), which after all has to know how big to deallocate when the vector is GC'ed. In fact if the metadata is interspersed with the allocated memory it probably will be just a length, since the current address is known.

-----

1 point by sacado 5866 days ago | link

Just wondering, but couldn't hash tables be implemented as "clever" association lists ? I mean, the interpreter / compiler could switch by itself to a hash-table representation when the user call 'assoc or 'alref on something like

  ((a b) (c d) (e f) ...)
In other words, a list of lists-of-size-2. The new representation would of course have to share the same semantics as a-lists (possible to cons, to get the car, the cddadr, ...)but with performance as a bonus.

Of course, it would still have to be able to go back to a "regular" list : I could do

  (push 3 (car l))
  -> ((3 a b) (c d) ...)
There, the compiler would have to see I don't want an association list anymore and switch back to a more traditional representation. In a way, this is the same idea as almkglor's one about lists secretly implemented as arrays.

This way, we would have one less primitive with all the benefits of hash tables. I don't know if it is possible, but as for know I don't see why it wouldn't.

-----

2 points by almkglor 5865 days ago | link

  (alref foo 'x)
     vs.
  foo!x

-----

1 point by sacado 5865 days ago | link

Hmm, right, but that's just a design issue. If pg wanted to simplify the axioms, he could remove hash tables, make alists behave the way I explained and make the syntax foo!x look for the 'x key in the alist 'foo.

Edit : There would be a problem however if we try foo!3 : do we mean the third element, or the element associated to the key 3 ?

-----