Arc Forum | Queue bug investigation

Arc Forum

Queue bug investigation

6 points by waterhouse 5273 days ago | 13 comments

I'll name my results first:

1. I've determined, with high confidence, the cause of the queue bug: garbage collection during enq'ing or deq'ing. (Atomic crap and threads have nothing to do with it. set-ca/dr! in ac.scm is implemented with a "thread-cell" pointing to a (malloc _scheme 1), but if you throw out the thread cell and just use a single (malloc _scheme 1), you still have the same problems. Meanwhile, I've found that a change in (current-gc-milliseconds) usually happens during the error step; and in the Arc version below, where the gc-milliseconds monitoring encloses the entire call to enq and deq, "usually" means "every time I've tried it".)

2. I've translated[1] the queue bug demonstration program into plain Racket (well, it has #lang mzscheme at the top so it's not really Racket but whatever), stripping out everything not necessary to cause the error. Occasionally this version segfaults.

3. So. Choices seem to be: 0) hope it's not too much of a problem right now; 0a) maybe reimplement scar and scdr so they check for GC, and if that happens, just redo it (problems: it might be prohibitively expensive; also, while it may nonetheless be doable, I did try a form of that in the Racket code and failed; the scope you need to watch for GC might be bigger than you expect); 1) someone can improve the pointer hack; 2) make Arc be implemented with mutable pairs or some such thing; 3) get the Racket people to do something (create a "no-gc-here" special form, make it possible to make all pairs mutable by default, something).

Incidentally, I'd like to know whether Paul Graham has problems with this when serving Hacker News or the Arc Forum. I'm pretty sure the source code includes queues, or it did as of arc3.1's release. Any otherwise inexplicable errors happening in queues? any otherwise inexplicable segfaults? If so, how does he deal with it? just shrug, run (nsv) again, and cross his fingers?

The report. First, here is akkartik's original program[2], except modified slightly to illustrate point #1. (I didn't figure this out until running through point #2 and trying several ways of modifying "ptr", "get-ptr", and "set-ca/dr!" from ac.scm.) It so happens that Arc imports the function "current-gc-milliseconds" from Racket, so we don't even need $ for it.

  (= q (queue))
  
  (mac did-gc (name . body)
    (w/uniq g
      `(let ,g (current-gc-milliseconds)
         ,@body
         (unless (is ,g (current-gc-milliseconds))
           (prn "GC at " ',name)))))

  (def verify(q)
    (prn q)
    (if q.0
      ; The contract for queues: q.1 is reachable from q.0 
      (unless (reclist [is _ q.1] q.0)
        (prn "error"))))

  (repeat 1000000
    (prn "iter")
    (repeat rand.10
      (verify q)
      (did-gc enq
              (enq 0 q)))
    (prn "deq")
    (until (is 0 qlen.q)
      (verify q)
      (did-gc deq
              (deq q))))

When you run this, and it (finally) errs, and you look at the last few lines, you will probably see something like this:

  iter
  (nil (0) 0)
  ((0) (0) 1)
  GC at enq
  ((0) nil 2)
  error
  Error: "set-cdr!: expected argument of type <pair>; given 'nil"

Now, as mentioned, I have the Racket version below[1]. Protip: Comment out the prn in 'verify when you don't need it; it makes the output much less massive. Also, use "racket" rather than DrRacket, because the latter takes probably 100x as long to print q as it does to perform the queue operations. And then make sure your terminal doesn't have infinite history. Use (enter! "qbug.rkt") [or whatever you call the file] and then (run) to run the converted Racket code. Also, suggestion for further investigation: To test this stuff closely, force (collect-garbage). Probably do it after 100 iterations or so, so there actually is garbage.

And, as mentioned, sometimes (when I run it a bunch) a plain segmentation fault happens. It's quite rare, but it does happen. It looks like this (I've commented out all prn's except those relating to the GC's):

  > (run)
  GC around enq
  GC around deq
  GC around enq
  GC around enq
  GC around deq
  GC around deq
  Segmentation fault

I'll mention that this is another way to cause segfault with these C pointers. (ptr-ref (get-ptr) _pointer 0) is a C pointer to a cons cell:

  > (ptr-ref (get-ptr) _pointer 0)
  #<cpointer>
  > (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 1)
  7
  > (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 2)
  '()
  > (ptr-ref (ptr-ref (get-ptr) _pointer 0) _scheme 0)
  Segmentation fault

So the segfault may still be something to do with the garbage collection clobbering the C pointers, rather than being some strange inexplicable thing wrong with Racket.

My Arc code reports many more GC's than the Racket code (probably because 'enq and 'deq are much more consing-heavy in Arc, not least because they are atomically invoked and stuff, and because my Arc GC-watcher watches the entire call to enq rather than just the calls to x-set-c[ad]r!), and I haven't gotten it to segfault in Arc yet. Arc might conceivably be immune to the segfaults, and we might be able to protect it from the errors.

By the way, I am using racket v5.0.99.7 x86_64 compiled from source on Mac OS X 10.6.6 this morning. (Ooh yeah. Conses take up 2x space, but Arc is maybe 5-10% faster.) I doubt very much that it makes a difference, though.

[1] https://gist.github.com/796949

[2] http://arclanguage.org/item?id=11347

2 points by akkartik 5273 days ago | link

3) get the Racket people to do something (create a "no-gc-here" special form, make it possible to make all pairs mutable by default, something).

Hmm, isn't this just a vanilla racket bug to be filed? GC should be transparent to the program.

-----

1 point by aw 5273 days ago | link

isn't this just a vanilla racket bug to be filed? GC should be transparent to the program.

Nope. If you're creating or manipulating Racket data structures yourself, you have to cooperate with the garbage collector to avoid confusing it. http://docs.racket-lang.org/inside/im_memoryalloc.html#(part... describes how.

I don't know how to cooperate with the GC from Racket (instead of from C, as described in the manual) while using pointers to modify Racket data structures. If it's possible, you might be able to get help on how to do it on the Racket mailing list. But it's not a Racket bug.

-----

1 point by akkartik 5273 days ago | link

I don't understand; that link is for the C API, but we're purely in racket land, right?

Edit: Oh, is this because of arc modifying scheme's immutable pairs? <slaps head> Talk about going against the grain of the platform..

-----

3 points by aw 5273 days ago | link

The issue here with GC isn't that we're modifying immutable pairs. It's that we're modifying a Racket data structure ourselves, without letting Racket do it for us, and without cooperating with the garbage collector. (If we modify a Racket data structure ourselves it doesn't matter if we do it from Racket or from C, it's the same issue). We'd have the same GC issue if we were using pointers to modify any Racket data structure, whether mutable pairs or immutable pairs or vectors or whatever.

(That's what performing "unsafe" operations means: we can cause seg faults or mess up the garbage collector. "Safe" operations are ones where we let Racket do the data structure manipulation for us, and so it's a Racket bug if there's a seg fault or GC problem).

That we're modifying Racket's immutable pairs could potentially give us a another bug though. When lists are immutable,

  (let ((x (list 1 2 3)))
    (foo x)
    (car x))

and if the compiler knows that the built-in Racket list and car are being called by "list" and "car" through the module system, the compiler could figure out that it can go ahead and return 1 for this expression without actually having to perform the car operation: the compiler can perform optimizations with immutable pairs that it can't do with mutable pairs. I haven't seen any evidence that this issue has bitten us yet, but it remains a possibility that some future implementation of Racket might perform new optimizations that would mess us up when we're modifying immutable pairs.

-----

1 point by akkartik 5273 days ago | link

Ok, makes sense. I hadn't followed that by 'data structure' you meant 'internal data structure'.

Arc's not just a userland racket program; ptr-set! and ptr-ref are low-level creatures. I hadn't focused on this fact, even though I'd seen the comments at set-ca/dr!

-----

2 points by aw 5273 days ago | link

Exactly. Back in the MzScheme 3xx days Arc was, to MzScheme, just an ordinary (if rather large) MzScheme program. And, with my runtime project (if successful), Arc will once again to Racket be just a big Racket program, so any bugs with Racket will be legitimate Racket bugs that we can go ahead and file a Racket bug report on ^_^

-----

2 points by akkartik 5273 days ago | link

This is actually a pretty big realization for me. Between the documentation thread 2 months ago (http://arclanguage.org/item?id=12860) and this realization that the queue bug is all the fault of our arc implementation, my opinion of the racket runtime is entirely rehabilitated. (http://arclanguage.org/item?id=12302 was the closest thing to a flame war I've been involved in here)

Now my only complaint with mzscheme in general is that it isn't dynamic enough, and forces us to use its module system :) But even that's just because we're using racket in this weird 'backwards compatibility' mode. I'm looking forward to ar because it'll let arc use all of racket's modern goodies (keyword args, extensible equality, ...)

Edit: I suppose we're still stuck with scheme's phase-separated macros.

-----

2 points by waterhouse 5265 days ago | link

Oh man I fixed it. http://arclanguage.org/item?id=13616

-----

1 point by aw 5273 days ago | link

Great work! Thank you!

2) make Arc be implemented with mutable pairs or some such thing

Working on it :-) https://github.com/awwx/ar

-----

2 points by aw 5273 days ago | link

P.S. I'm really, really, really happy that you've narrowed the problem down to a garbage collection issue. Implementing Arc lists with mpair's seemed to fix the queue bug, but I had no proof. I was still possible that I had merely moved data structures around in memory or whatever enough so that the bug simply wasn't manifesting by running the queue test. Now that we know that it's a GC problem, I have much more confidence that the mpair implementation is in fact one way to fix the bug.

-----

2 points by shader 5273 days ago | link

Just wondering, I haven't looked into ar that deeply yet:

How invasive is the change to mpairs really? Is it drastic enough you needed to create most of a new runtime, or is that just because you're like the rest of us and want to experiment ^^

-----

2 points by aw 5273 days ago | link

http://awwx.ws/mpair0 was my attempt to simply modify Arc 3.1's ac.scm to use mpair's. I got enough working to see that using mpair's appeared to fix the queue bug, but my implementation was otherwise quite buggy. (Lots of places where Racket lists appeared where Arc lists should have been, and vice versa; and weird mixes of the two, etc.) I have no doubt that someone cleverer than me could get it to work, but I just kept getting lost trying to figure out my bugs.

So I decided to implement the compiler incrementally, where I could test each step as I went. It would take a lot more work (a bit like getting my jeep stuck in the swamp, and so hauling myself out ten feet at a time with a winch), but it was an approach that I knew I would be successful at.

Then, since I was rolling through a compiler rewrite anyway, I decided to also go ahead and reflect the compiler into Arc to make it more hackable ^_^.

-----

1 point by evanrmurphy 5273 days ago | link

One of the 3 goals listed in the readme is "to make Arc more hackable," so I would guess that mpairs isn't most of it. Reorganizing ac.scm and adding unit tests so that the internals are easier to modify is probably a significant motivation.

Am I off-base here, aw?

-----