| I'll name my results first: 1. I've determined, with high confidence, the cause of the queue bug: garbage collection during enq'ing or deq'ing. (Atomic crap and threads have nothing to do with it. set-ca/dr! in ac.scm is implemented with a "thread-cell" pointing to a (malloc _scheme 1), but if you throw out the thread cell and just use a single (malloc _scheme 1), you still have the same problems. Meanwhile, I've found that a change in (current-gc-milliseconds) usually happens during the error step; and in the Arc version below, where the gc-milliseconds monitoring encloses the entire call to enq and deq, "usually" means "every time I've tried it".) 2. I've translated[1] the queue bug demonstration program into plain Racket (well, it has #lang mzscheme at the top so it's not really Racket but whatever), stripping out everything not necessary to cause the error. Occasionally this version segfaults. 3. So. Choices seem to be: 0) hope it's not too much of a problem right now; 0a) maybe reimplement scar and scdr so they check for GC, and if that happens, just redo it (problems: it might be prohibitively expensive; also, while it may nonetheless be doable, I did try a form of that in the Racket code and failed; the scope you need to watch for GC might be bigger than you expect); 1) someone can improve the pointer hack; 2) make Arc be implemented with mutable pairs or some such thing; 3) get the Racket people to do something (create a "no-gc-here" special form, make it possible to make all pairs mutable by default, something). Incidentally, I'd like to know whether Paul Graham has problems with this when serving Hacker News or the Arc Forum. I'm pretty sure the source code includes queues, or it did as of arc3.1's release. Any otherwise inexplicable errors happening in queues? any otherwise inexplicable segfaults? If so, how does he deal with it? just shrug, run (nsv) again, and cross his fingers? -- The report. First, here is akkartik's original program[2], except modified slightly to illustrate point #1. (I didn't figure this out until running through point #2 and trying several ways of modifying "ptr", "get-ptr", and "set-ca/dr!" from ac.scm.) It so happens that Arc imports the function "current-gc-milliseconds" from Racket, so we don't even need $ for it. (= q (queue))
(mac did-gc (name . body)
(w/uniq g
`(let ,g (current-gc-milliseconds)
,@body
(unless (is ,g (current-gc-milliseconds))
(prn "GC at " ',name)))))
(def verify(q)
(prn q)
(if q.0
; The contract for queues: q.1 is reachable from q.0
(unless (reclist [is _ q.1] q.0)
(prn "error"))))
(repeat 1000000
(prn "iter")
(repeat rand.10
(verify q)
(did-gc enq
(enq 0 q)))
(prn "deq")
(until (is 0 qlen.q)
(verify q)
(did-gc deq
(deq q))))
When you run this, and it (finally) errs, and you look at the last few lines, you will probably see something like this: iter
(nil (0) 0)
((0) (0) 1)
GC at enq
((0) nil 2)
error
Error: "set-cdr!: expected argument of type <pair>; given 'nil"
Now, as mentioned, I have the Racket version below[1]. Protip: Comment out the prn in 'verify when you don't need it; it makes the output much less massive. Also, use "racket" rather than DrRacket, because the latter takes probably 100x as long to print q as it does to perform the queue operations. And then make sure your terminal doesn't have infinite history. Use (enter! "qbug.rkt") [or whatever you call the file] and then (run) to run the converted Racket code. Also, suggestion for further investigation: To test this stuff closely, force (collect-garbage). Probably do it after 100 iterations or so, so there actually is garbage.And, as mentioned, sometimes (when I run it a bunch) a plain segmentation fault happens. It's quite rare, but it does happen. It looks like this (I've commented out all prn's except those relating to the GC's): > (run)
GC around enq
GC around deq
GC around enq
GC around enq
GC around deq
GC around deq
Segmentation fault
I'll mention that this is another way to cause segfault with these C pointers. (ptr-ref (get-ptr) _pointer 0) is a C pointer to a cons cell: > (ptr-ref (get-ptr) _pointer 0)
#<cpointer>
> (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 1)
7
> (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 2)
'()
> (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 0)
Segmentation fault
So the segfault may still be something to do with the garbage collection clobbering the C pointers, rather than being some strange inexplicable thing wrong with Racket.My Arc code reports many more GC's than the Racket code (probably because 'enq and 'deq are much more consing-heavy in Arc, not least because they are atomically invoked and stuff, and because my Arc GC-watcher watches the entire call to enq rather than just the calls to x-set-c[ad]r!), and I haven't gotten it to segfault in Arc yet. Arc might conceivably be immune to the segfaults, and we might be able to protect it from the errors. By the way, I am using racket v5.0.99.7 x86_64 compiled from source on Mac OS X 10.6.6 this morning. (Ooh yeah. Conses take up 2x space, but Arc is maybe 5-10% faster.) I doubt very much that it makes a difference, though. [1] https://gist.github.com/796949 [2] http://arclanguage.org/item?id=11347 |