Arc Forumnew | comments | leaders | submitlogin
Proof of concept
7 points by waterhouse 3666 days ago | discuss
For those of you with x86-64 processors running either Linux or OS X, you might like to try a little proof of concept of a system of mine. First, ensure that you have the programs "racket", "gcc", "nasm", and ideally "rlwrap" available in your $PATH. Then:

  git clone https://github.com/waterhouse/emiya
  cd emiya
  sh arc.sh
Now paste in the following:

(let u (handle-and-dlsym) (ga reversing-timing9 6) (between h range.2 (prn) (gc) (withs f (fn (x y z m u v) (call-asm reversing-timing9 x y z m u v)) n (* 400 (expt 2 20)) b make-bytes.n (time:do (pr (f b n 4096 u 100000 600) " ")))))

That does the following:

  Puts RTLD_DEFAULT and a pointer to the "dlsym" function, which
    is used to load C library functions, into a byte string.
  Assembles a file called "reversing-timing9.asm" and loads the
    contents (with the head chopped off so that the executable
    code begins at byte 0) into a byte string.
  Twice does the following:
    Grabs a 400 MB byte string.
    Passes both byte strings and other arguments to an assembly
    program, which does the following:
      Sets up a two-semispace GC scheme inside the 400 MB buffer.
      Creates a list of length 100000.
      Reverses it 600 times, each time by consing up another list.
      Records the minimum and maximum time that each cons operation,
        and each reverse operation, took.
      Prints out various information at the end.
Example output from my laptop: [0]

  flips: 14 moved: 2800014 mbytes: 3200016 tbytes: 3200016 trace-ops: 2760 alloc: 4697404944
  achtung 14
  traced-count: 2800014 gc-cycles: 14 overflow-count: 234773 stacked-count: 2800014
  cons-min: 5 cons-max: 20840 rev-min: 1734397 rev-max: 3392300
  time: 1085269739 cycles: 2387593418 cycles/usec: 2199
  24 time: 1086 cpu: 1086 gc: 0 mem: 3920

  flips: 14 moved: 2800014 mbytes: 3200016 tbytes: 3200016 trace-ops: 2760 alloc: 4697404944
  achtung 14
  traced-count: 2800014 gc-cycles: 14 overflow-count: 234773 stacked-count: 2800014
  cons-min: 5 cons-max: 28711 rev-min: 1734268 rev-max: 3334688
  time: 1089901969 cycles: 2397784330 cycles/usec: 2199
  24 time: 1090 cpu: 1090 gc: 0 mem: 880
  nil
The reported times of "{cons,rev}-{min,max}" are in nanoseconds. Thus, the minimum time for a cons operation was 5 nanoseconds on both tests of the program, and the maximum time was 20 microseconds on the first test and 28 microseconds on the second. The reverse operations took roughly 1.7-3.3 milliseconds each.

Why did the cons operations sometimes take unusually long? There are two cases where that happens. One is when they trigger a GC flip; at that point, some administrative work is done, and then the stack [practically empty] and a couple of registers and global variables get traced. The other is when they trigger a spat of GC tracing work: roughly, every time we allocate 4096 bytes, we stop and trace up to 4096 * 4 = 16384 bytes of objects from the GC work list. I can tell you that the GC flip itself seems to take around 15 microseconds, while each chunk of tracing work takes perhaps 8-10 microseconds; you can deduce this by replacing the 4096 in the test case with smaller (e.g. 128) or larger (e.g. 40960) numbers.

What slows down the reverse operations? Well, there are two main things that happen. First, cons operations in the aggregate become slower after a GC flip, since they become interleaved with GC tracing work, until the work is done. Second, car and cdr become potentially slower directly after a GC flip, again until the work is done, because they may pull old pointers out of memory and have to move the objects they point to.

In this program, car always grabs an integer, and cdr always grabs a cons cell [nil is implemented as a cons cell that's full of itself]. After a GC flip, if we're at cons cell m out of a list of length n, the next n-m cdrs would hit the read barrier and have to move the pointers they grabbed--although, in practice, since the GC work list is a stack and it traces the objects that have most recently been moved, and since every cdr alternates with a cons, in fact only up to 4096/16 = 256 cdrs will hit the read barrier before GC tracing causes the rest of the list to have already been moved by the time "reverse" gets there.[1] So what we're seeing is almost entirely slowdown due to GC tracing work, but it moves the same set of conses that the read barrier would. [Actually, we'll see later that it moves twice as many.]

The apparent slowdown of "reverse" by 2x is rather misleading, though. You see, the instrumentation--diffing the value of a clock before and after each cons operation--makes the program 6.6x slower on my laptop. (Yeah, it turns out the RDTSC instruction is fairly expensive, and calling mach_absolute_time or whatnot was ~3x worse than that.) This would not be used in production! If we replace the "timed_cons" in "reverse_loop" with a plain "cons", then we get this output:

arc> (let u (handle-and-dlsym) (ga reversing-timing9 6) (between h range.2 (prn) (gc) (withs f (fn (x y z m u v) (call-asm reversing-timing9 x y z m u v)) n (* 400 (expt 2 20)) b make-bytes.n (time:do (pr (f b n 4096 u 100000 600) " ")))))

  flips: 14 moved: 2800014 mbytes: 3200016 tbytes: 3200016 trace-ops: 2760 alloc: 4697404944
  achtung 14
  traced-count: 2800014 gc-cycles: 14 overflow-count: 234773 stacked-count: 2800014
  cons-min: 5 cons-max: 85 rev-min: 215836 rev-max: 1886463
  time: 164039789 cycles: 360887536 cycles/usec: 2200
  24 time: 164 cpu: 165 gc: 0 mem: 3920

  flips: 14 moved: 2800014 mbytes: 3200016 tbytes: 3200016 trace-ops: 2760 alloc: 4697404944
  achtung 14
  traced-count: 2800014 gc-cycles: 14 overflow-count: 234773 stacked-count: 2800014
  cons-min: 5 cons-max: 90 rev-min: 216515 rev-max: 1812203
  time: 163788477 cycles: 360334644 cycles/usec: 2199
  24 time: 164 cpu: 164 gc: 0 mem: 880
  nil
Aww yeah. Note that the "cons-max" times are now miniscule, because the only remaining timed_cons operations happen at the start of the program, before any GC flips or GC work can happen. Anyway, we now find that "reverse" gets slowed down about 9x, from .216 milliseconds to perhaps 1.85 milliseconds.

A scheme for "hard" real-time GC must address the question of how much the user program's execution may be slowed down over critical lengths of time. User demands vary. Games and other applications may need to deliver a frame to the screen every 33 or 16 milliseconds; to not miss a frame, if drawing a frame takes up to k milliseconds of CPU work, then the GC system must slow down the user program by no more than an average factor of 16/k during any period of 16 milliseconds. On the other hand, for batch video encoding, all you care is that the entire runtime of the program not be slowed down "too much". An ideal browser or other tool should respond "immediately" to user input; Jakob Nielsen says 100 milliseconds is good for this[2]. Robots and other physical systems may need to respond to stimuli within some length of time (unfortunately waterhouse lacks direct experience with this).

The "real-time" performance of a system may be described with a function, f(t), mapping each time-length to the system's worst-case CPU utilization during any time-chunk of that length. This function will, for the most part, increase monotonically.[3] A stop-the-world GC system with GC pauses of length k that occur every m time-units will have f(t) = 0% for t < k, f(t) = (t-k)/t for k < t < m, and f(t) approaching (m-k)/m as t goes to infinity.

So, how "real-time" is waterhouse's system? (And how will it be in the future?) In terms of read-barrier/GC-work slowdown, I would characterize this "reverse" program as pretty close to pessimal. The work that you do is completely {cons,car,cdr}ing, and you effectively trigger a read barrier on 100% of the list you're processing.

(The alert reader might note that a GC flip happens when you're partway through reversing a list, and the "read barrier" and stuff would only get triggered on the remaining portion of the list, rather than the full list. The more alert reader will note that, in fact, the GC-work system will end up tracing the whole list (or, like, the remainder of the list plus the portion we had constructed, which will always add to the full length of the list), and due to the GC work factor of 4, it will do all of that during the current "reverse" operation unless we were in the last quarter of the list at the flip. The chance of that happening on 14 GC flips, assuming randomness, is miniscule.)

... The even more alert reader will note that, actually, we see that twice the list gets moved on each GC cycle, and this is because waterhouse had pushed the first copy of the list onto the stack to demonstrate that the GC would trace the stack. If we change this, by commenting out several lines (see reversing-timing9-onelist.asm), we get output looking like this:

  flips: 7 moved: 700007 mbytes: 1600016 tbytes: 1600016 trace-ops: 700 alloc: 13186634784
  achtung 7
  traced-count: 700007 gc-cycles: 7 overflow-count: 234769 stacked-count: 700007
  cons-min: 5 cons-max: 7516 rev-min: 247140 rev-max: 1032770
  time: 159289284 cycles: 350436440 cycles/usec: 2200
  16 time: 160 cpu: 160 gc: 0 mem: 3920
In that case, the slowdown is more like 4x.

Can we isolate the effects of the read barrier, turning off GC tracing for the most part? In fact, we can. The GC work factor is currently hardcoded as 4 by the statement "shl rcx, 2" in gc-system20.asm. We can remove it, creating a work factor of 1, or even replace it with, say, "shr rcx, 3", creating a work factor of 1/8. (But too low a work factor causes the GC to not finish tracing before the next GC flip happens.) In that case, most of the list will not get traced by the GC before the read barrier hits it, and the GC work won't slow down consing as much. With a work factor of 1/16, I tend to get a rev-max/rev-min ratio closer to 3-3.5x.

(It makes sense that the read barrier would be somewhat faster than tracing. Tracing involves more memory operations involving the GC work list, which is implemented as a stack made of the carcasses of objects that have already been copied.)

Let's talk about GC work ratios. What is reasonable? If your working set has a maximum size of x, then a GC work ratio of n ensures that you'll be done tracing the working set by the time you've allocated x/n bytes of fresh objects. In that case, your semispace will need to fit (x + x/n) = x(1+1/n) bytes, and since there are two semispaces, you'll need x(2+2/n) bytes of memory. A work factor of 4 means you need 2.5x bytes--or, equivalently, you use memory at 40% "efficiency"--the working set is 40% of the memory you need. Work factors approaching infinity yield 50% memory efficiency. A work factor of 1 yields 25% efficiency, 1/2 yields 16%, 1/4 yields 10%.

Given a certain working set and a maximum amount of memory, you will need a work ratio that is at least a certain number. It's always possible to pick a higher work factor--stop-the-world GC is equivalent to an infinite work factor--in which case you finish tracing early, after which you'll have a while to allocate objects without needing to do GC work. But, no matter what factor you pick, the required ratio is also the average ratio of tracing work to allocation work that your program must perform. With a working set of 40% of main memory, and therefore a required work factor of 4, there will be a drag on all the allocation work you do, slowing it by a factor of 4 times however much more expensive copying and tracing is than allocation. (We've seen it's perhaps 3x more expensive; this might be tweakable.) If your program mostly does matrix exponentiation in-place, the extra cost of allocation doesn't really matter. On the other hand, if your program does nothing but allocate objects, your "memory efficiency" may be 40%, but your runtime efficiency will probably be less than 10%.[4]

The earlier results that showed a 9x slowdown effectively illustrated a GC work factor of 2, while the 4x slowdown was for effectively a factor of 1. (How can doubling the extra work yield a more-than-doubling of slowdown? I don't know. It's a little more than 4 versus a little less than 9. I guess cache behavior could have something to do with it, too.) So I would suggest that a work factor of 1, with a memory efficiency of 25%, might be a reasonable option to investigate for hard real-time requirements.

Issue: The slowdowns of GC tracing (4x) and memory barriers (3-3.5x) might be additive in practice (to 6-6.5x). The fact that the GC traces depth-first from the most recently moved object tends to limit that effect; as we see here, if you're cdring down a long and flat list and doing tracing work, the GC will move the objects right before you reach them. If you were making repeated random accesses into a binary tree, though, you would probably still hit the read barrier a lot. If this is a problem, one option is to temporarily reduce the GC work factor (perhaps to 0) after a flip, while you're hitting a lot of read barriers, and later bump it up to 1.5 for a while. Another, more elegant approach is to count "moving objects" as GC work, so that if you hit a bunch of read barriers, you do less tracing in the next GC work unit--and also, if, when tracing objects, you end up having to move a bunch of objects, you trace fewer objects this unit. That should probably be done in any case--tracing an object is much cheaper when the pointers it contains don't need to be moved, and any attempt to divide GC work into equal time units should take this into account. Last, if we can use multiple threads[5], you can have another thread do GC work, and it'll probably stay far ahead of allocation for most programs.

Setting aside regularly scheduled GC work, if real-time requirements are still iffy due to just the read barriers, another possibility is to create "socially irresponsible" versions of car and cdr that don't have the usual read barrier.[6] If they pull a fromspace pointer out of memory, they won't move the object it points to (though they may check if it's already been moved). To maintain the correctness of the GC algorithm, this means we must also use a write barrier that keeps track of any stale pointers we store in memory, which must be traced later. This generates GC work, and therefore should probably not be used in general. However, it might be used (a) when you face a really tight deadline and can afford to clean the mess up later, and (b) for a few functions like "length" or nested-data-structure lookups that don't actually store any of the pointers they grab. (E.g. (list-ref xs n), as a subroutine, only needs a full read barrier on the car it grabs at the end.)

Last, let me make a little point. What are we up against here? We have a program that creates a list of length 100k and reverses it 600 times. What's the competition like?

  Racket, with a process memory usage of ~100 MB:
  > (time (let loop ((i 600) (xs (range 1 100000))) (when (> i 0) (loop (- i 1) (reverse xs)))))
  cpu time: 2333 real time: 2337 gc time: 1710

  SBCL, with a process memory usage of ~80 MB:
  * (let ((xs nil)) (dotimes (i 100000) (push i xs)) (time (dotimes (i 600) (setf xs (reverse xs)))) (car xs))

  Evaluation took:
    0.439 seconds of real time
    0.435126 seconds of total run time (0.375509 user, 0.059617 system)
    [ Run times consist of 0.098 seconds GC time, and 0.338 seconds non-GC time. ]
    99.09% CPU
    966,035,242 processor cycles
    960,002,640 bytes consed
    
  99999

  reversing-timing9, being handed 80 MB:
  flips: 31 moved: 6200031 mbytes: 3200016 tbytes: 3200016 trace-ops: 6107 alloc: 4584048144
  achtung 28
  traced-count: 6200031 gc-cycles: 31 overflow-count: 234781 stacked-count: 6200031
  cons-min: 5 cons-max: 84 rev-min: 216315 rev-max: 1929215
  time: 191296665 cycles: 420852654 cycles/usec: 2199
  24 time: 192 cpu: 192 gc: 0 mem: 880
Now, this is not necessarily fair. This is hand-written assembly, versus the output of compilers. However, I don't think I've cheated; I'm using generic car/cdr operations that do full type-checking and read barriers, even though the cars are always fixnums and the cdrs are always conses. (If you want to see how much one read barrier+typecheck costs, you can replace the "car" in reverse_loop with "cheap_car". The difference to the program's runtime is something like 8%.) I think this program could be the output of a very efficient, but not brilliant, compiler. And Racket and SBCL don't need a read barrier, although I believe they both use generational GCs and therefore have a write barrier.

Nevertheless, let me just say that a 4x slowdown means we're about twice as slow as SBCL, and 3 times faster than Racket. Bahahahahahahahaha. (If you subtract out Racket's GC time--which is most of its execution time--then we end up 33% slower than it.) Also, I don't think the Racket or SBCL compilers generally allocate unnecessary Lisp objects [although Racket's conses are 8 CPU words]--their slowdown compared to this hand-written assembly probably comes from comparative inefficiency of the underlying operations, similar to how the instrumentation slowed down "cons" earlier. (A quick glance at SBCL's (disassemble 'reverse) is confusing, but might support this notion.) In that case, for example, if we had SBCL's compiler and the program took 350 msec, then the "4x slowdown" would still add ~600 msec (4x means +3x of the original ~200 msec), and would become less than a 3x slowdown. If we had half of SBCL's speed and took 700 msec, then the slowdown would bring us to 1300 msec, which would still put us at nearly twice Racket's performance; even a relatively pessimistic "6.5x slowdown" would put us at 1800 msec, still faster than Racket. Recall also that this is an estimate of worst case behavior.

Anyway, I would suggest that a scheme for a two-semispace copying real-time GC in the style of Henry Baker is likely practical for many real-time applications, and is also appropriate for non-real-time applications. (For non-real-time applications, the biggest cost during general program execution is unnecessary read barriers; anecdotally I can tell you it's about as pricey as a type-check, both of which cost about 25% speed on a fairly pessimal program that just adds up integers from a list. This difference is probably swamped by other differences among Lisp implementations, and at any rate it wouldn't be difficult to add a compiler flag to turn off the read barriers and make the GC-flip trace everything.)

The concept will be more fully proven when other core data structures--vectors, closures, hash tables--have working implementations. This is on waterhouse's to-do list.

[0] All timing data comes from waterhouse's laptop, a mid-2011 MacBook Pro with a 2.2 GHz Intel Core i7 processor.

[1] Actually, since the ALLOCPTR gets bumped up by both fresh conses and moved conses, and these will alternate, we might see at most 128 read barrier hits. Come to think of it, it wouldn't be hard at all to count read barrier hits. But not now.

[2] http://www.nngroup.com/articles/response-times-3-important-limits/

[3] Counterexamples are possible. Let's imagine a hypothetical system that lets the user program work for a length of time t, then pauses it for time t, and repeats indefinitely. In that case, f(2t) = 50% (the user gets 50% CPU usage during any period of 2t time-units), but f(3t) = 33% (choose a period beginning at the start of a pause). However, it is a mathematical fact that f(t) ≤ f(nt) for any integer n: if there is any period nt during which the user had the CPU x% of the time, then we can cut it into n sub-periods of length t; x% being the average CPU utilization over all sub-periods, the least CPU utilization among the individual sub-periods will be less than or equal to the average (with equality only if the sub-periods are all equivalent), so there must be at least one t-period with utilization ≤ x%.

Also, f(k) ≥ f(t) * (floor(k/t)/(k/t)) for any k and t. (This comes from dividing k into chunks of size t, noting that each chunk must give t*f(t) units of CPU time, and assuming the remainder yields zero CPU time. This worst case can happen if the system spends most of the time doing nothing but delivers "over 9000" CPU usage for a very short time every t units.) Thus, f(k) > 1/2 f(t) for k > t (worst case at k=1.999...), f(k) > 2/3 f(t) for k > 2t, f(k) > 9/10 f(t) for k > 9t, and so on.

[4] Generational GC might help with that; it would mean introducing a write barrier, which would slow down the program overall, but might be worth it.

[5] Yes, waterhouse has given multithreading a great deal of thought. The short story is that most of it can be handled just fine with "compare and swap" or equivalent. A somewhat sticky issue is with handling long bytevectors: due to the ABA problem, on x86 processors, it may be necessary to use a lock if you want to write to the portion of a bytevector that hasn't been moved from fromspace yet. Long vectors are fine; you can leave a unique tombstone value to be sure, though this means writes to the old half of a vector will need to be compare-and-swaps, and reads may need to wait a bit, which is somewhat annoying. Also, the "GC work list is a stack made from the corpses of old objects" paradigm might need tweaking.

[6] This idea comes from one P.T. Withington, in "How Real is Real-Time GC?". Also, of course, the main GC scheme is adapted from Henry Baker's 1978 proposal, "List Processing in Real Time on a Serial Computer". The scheme for figuring out what registers to save at a GC flip, by looking through a table whose keys are return addresses, is adapted from a technique in "Atomic Heap Transactions and Fine-Grain Interrupts" by Shivers, Clark, and McGrath.