Arc Forum | Of course nobody actually cares anyway, but SNAP's bytecode engine now seems to ...

Arc Forum

2 points by almkglor 6281 days ago | link | parent

Of course nobody actually cares anyway, but SNAP's bytecode engine now seems to be working.

The tests/executors_test.cpp contains the test for the bytecode/executor engine. It's very basic - it just converts a list-encoded symbolcode/bytecode to a function, then executes that function.

The symbolcode calls a builtin function, '$ (which I'm reserving for implementation-specific Arc stuff). Since this is CPS, the builtin function returns by calling back to a bytecoded/interpreted function.

Assuming you have boost in a standard include, and g++, you can try it out by downloading/gitting from http://github.com/AmkG/snap/tree/master , then go to the tests/ directory and:

  ./compile executors_test.cpp
  ./a.out

IIRC the only thing I've used so far in boost are the shared_ptr and scoped_ptr, which are probably the most portable parts of boost (and don't require compiling a library).

It's tested on 8.04 Ubuntu with the versions of g++ and boost included in it (don't remember exact versions, and don't currently have access to my machine).

This will probably end up being how SNAP initializes:

  1. Construct a bytecode list for compilation by the
     executor system.  This bytecode list will be derived
     from a version of arc2c compiling itself with the
     assumption that it's safe to use macros in the same
     environment.
  2. Convert the bytecode list to a function.  Since this
     requires the executor system, which assumes that it
     normally executes interpreted code and only
     occassionally built-in CPS functions, this conversion
     will need to use some temporary global variables to
     access bits of the executor system that are normally
     accessible only to interpreted code.
  3. Execute the function.  This will end up being similar
     to how interpreted code is normally executed, except
     that we don't have to go through work queues etc.

And of course... a few more tests would be necessary for the executor system, then we have to consider how to handle a worker thread pool to execute processes, and also to disable the use of a worker thread pool if we want to use a single OS thread.

1 point by almkglor 6269 days ago | link

Update...

Currently I'm waffling on I/O. Asynchronous I/O is hard, let's go redditing...

Originally the plan was that there would be a separate builtin type for I/O ports, which would be just shared_ptr wrappers around AsyncPort objects.

Of course there's the rather minor problem though when several separate Arc processes try to access the port.

And there's also the minor problem where in Linux some (most?) filesystems assume they're so fast that they will always return "ready" for select() or poll(). Haha.

Currently I'm trying to implement ports as separate C++ processes with their own message queues.

And just suddenly now, it dawned unto me that this time there's the minor problem where I/O is performed on reads. A read might consume a variable number of characters; the problem is, which variable number of characters? How do I determine, before I even look at anything, how many characters to consume? Since I only get one queue entry, and another process might get the next entry? Take for example the 'read function: it knows it will start a list when it sees #\(, but it can't know how many characters, total, the list will be. We can't "release" the port until we actually complete the 'read function, because if we do, another process might read from the port and start reading random characters, meaning we (this process and the other proess) both end up not reading anything useful.

I'm not making sense I think.

So in the end I may very well have AsyncPort objects in the Arc-side, and when reading pass in an entire function which will consume characters, and then when the function returns a value (or throws an error), then we "know" that the port is released and another process can grab it by passing in a new function.

-----

1 point by stefano 6268 days ago | link

For the read problem, the shared I/O port should have a buffer containing the characters read, and an index on this buffer for every process that read data from it. This way every process can act as if it were the only one to read data because it will read starting from its own index. This doesn't work for writing data, of course, so output port shouldn't be shared.

-----

1 point by almkglor 6268 days ago | link

Suppose process A reads up to the second-to-the-last character. Then suppose it passes the port to process B. Process B does a read. What does it read? First character or last character? What would the programmer reasonably expect?

-----

2 points by stefano 6267 days ago | link

Both the options seem reasonable to me. If B wants to read the first character probably he would open a new port, so when B receives a port from A he should read from the last character. What if now A tries to read from the port? To keep the behavior of the virtual machine consistent (always copy) A should read the last character, and not EOF. If this didn't happen, B would have successfully changed the state of something belonging to A. In conclusion, every process should have their own index in the buffer, and when a port is passed from A to B, B gets a new index initialized with the value of A's index.

-----

1 point by almkglor 6267 days ago | link

Huh. Didn't think of that.

Anyway, the internet connection at home is down, I've got girl problems again, and I am in the mood to sleep before I start shooting everybody (especially since I suck at first player shooters, so playing those games just pisses me off). Don't anybody go around expecting anything decent for SNAP this weekend, not that there was anything decent there anyway.

-----

2 points by almkglor 6265 days ago | link

http://www.monkey.org/~provos/libevent/ Hmm. Possibly useful, we might be able to use its event_loop(). Possibly we can implement a central I/O process written in C++ which simply waits for I/O, then notifies the calling process that actually performs I/O.

As an aside, the problem with having separate buffer pointers for each process in each input port is that it stops being orthogonal to output ports, where it's simply impossible to have separate buffer pointers.

So I think I'm putting it back to "I/O ports are really processes", where the I/O port process is an Arc process that acts as a serializing mutex around the I/O port.

-----

1 point by almkglor 6262 days ago | link

Oh man, asynch I/O is hard...

As an aside, the newer (but presumably less well-developed) libev http://software.schmorp.de/pkg/libev.html supports nice timeouts and child process monitoring, which would really help in implementing 'sleep and 'system.

Further, I'm also thinking of ways of reusing continuation closures.

In arc2c many continuation closures can be eliminated because the compiler can inline global functions with impunity (simply by detecting if a global variable is assigned with a function exactly in one place, at the top-level). These remove the need to construct continuations for many cases, such as calling the basic 'car and 'cdr functions (they are inlined to the 'car and 'cdr primitives), which in most cases are defined only once, in the library.

However in SNAP we want to support full dynamism, so we can't do inlining for many cases and we must actually perform a CPS-style call, which requires constructing a continuation closure.

A continuation closure, once invoked, can usually be discarded immediately; in fact, the continuation closure's data can even be completely overwritten.

The only exception here is with 'ccc. So, what I'm thinking is, we add a new type of closure, a k-closure, which is just a subtype of standard closures. It contains an additional bool, specifying if it can be safely reused (the default).

When 'ccc is invoked, it clears that bool. In addition, it must also search through the entire existing "stack" of k-closures, clearing all their reusable flags.

A continuation function can then simply reuse its own k-closure while constructing a new continuation, unless the reusable flag is cleared.

-----

1 point by almkglor 6259 days ago | link

Okay, I've implemented the continuations idea above. I also tested it a little, and tested a beta version of the arc to bytecode compiler (that's now on the git). Testing revealed a hidden bug in the executors framework ^^ specifically the first bytecode in each function did not get executed. Since the most usual first bytecode is a check of the number of parameters, I didn't see the error ^^

I'm thinking of also implementing the continuations idea above in arc2c.

Edit: Oh, and since we're on the subject of continuations: I don't know, but the lack of a full 'dynamic-wind support in Arc seems rather, err, well, puzzling. What's supported is about half of 'dynamic-wind, i.e. the half that handles exiting the 'dynamic-wind; it's exposed as the ac.scm 'after. This means that if someone creates a generator library and does something like:

  (defgen foo (f)
    (w/infile p f
      (yield (read p))))

... it won't actually work, because once 'yield calls outside of the context of 'w/infile (via a continuation), the file is closed and can't be reopened (because 'w/infile doesn't use an "before" handler, just an "after" handler).

What I'm wondering about is: is this actually OK? If I implement just "after" handlers on continuations, is that "good enough" for Arc?

-----

1 point by stefano 6261 days ago | link

How do you safely inline functions?

Suppose you have a file with these definitions:

  (def f () 8)
  (def g () (f))

and you compile it. arc2c would then inline f when called by g. Suppose now that I load the compiled file from the repl and then type:

  (def f () 5)

now calling g from the repl should return 5, but f were inlined before, so it will return 8. Does arc2c handle this kind of problems? If it does, how is this implemented? I'm asking because I wanted to do a similar optimization in nyac, but this problem blocked me.

-----

1 point by almkglor 6261 days ago | link

Simple: arc2c doesn't have a REPL ^^

Basically, it expects all definitions to be in the file. It only inlines definitions that are:

1. Done at the top-level, including loaded and required files

2. Defined only once (i.e. assigned to only once)

Basically this is largely an optimization of the arc.arc and ac.scm libraries.

This optimization is useable only if 'eval is not ever used. If 'eval is ever used, this optimization will have to be disabled.

-----

2 points by stefano 6260 days ago | link

I wonder if there is a way to make that optimization work also when eval is used, maybe by tracking a dependecies list for every function, because it is really helpful. To make an example: the '- function takes a variable number of args, so all the arguments passed to it must be consed. With inlining it would be possible to avoid consing the arguments when they are known to be 1 or 2. To give you an idea of how much consing the args costs, take the fibonacci example: to calculate the 32nd number on NYAC when consing '- args it takes (on my computer) ~3.4 secs, without consing '- args it takes ~0.6 secs. This is a huge difference.

-----

1 point by almkglor 6258 days ago | link

> With inlining it would be possible to avoid consing the arguments when they are known to be 1 or 2.

Hmm.

Since I control the VM anyway, it may be possible for me to add some sort of feature/bytecode/built-in-function that can perform reduction of arguments without consing (i.e. just by allocating a continuation function and reusing it). It would be possible to also have it "short-circuit" so to speak, prevent it from allocating a new closure and just pass its own called continuation directly to the child function.

Basically it would be like this:

  (with (f0 (fn () (handle-0-arguments))
         f1 (fn (a) (handle-1-argument a))
         f2 (fn (a b) (handle-2-arguments a b)))
    (fn rest ; not actually expanded into a list
      ; this will be implemented in C
      (if
        (no rest)
          (f0) ; pass the continuation
        (no (cdr rest))
          (f1 (car rest))
        (no (cdr:cdr rest))
          (f2 (car rest) (cadr rest))
        ; perform reduction
          (ccc
            ; enclose these variables
            (with (a (f2 (car rest) (cadr rest))
                   rest (cdr:cdr rest))
              ; also enclose the continuation
              (afn (k)
                (zap a (f2 (car rest) (cadr rest)))
                    ; rest will be an index into the
                    ; closure variables
                (if (zap cdr rest)
                    (self k)
                    (k a)))))))))

Oh and here's a blog of SNAP: http://snapvm.blogspot.com/

-----

2 points by stefano 6257 days ago | link

So handle-n-arguments is a special form? Exposing such a special form would break compatibility with both arcN.tar and Anarki, but this seems a good solution.

> Oh and here's a blog of SNAP: http://snapvm.blogspot.com/

Nice! This way informations about SNAP will be no more spreaded in the forum. I suggest copying everything already present in the forum about SNAP into the blog.

-----

1 point by almkglor 6257 days ago | link

It won't be exposed as a special form - rather it will be exposed as a bytecode, and the symbolcode-to-bytecode compiler will be exposed as a function that the implementation specific '$ will return. This should reduce the conceptual footprint of extensions to Arc that I add strictly for efficiency, versus actual axioms.

Basically we just grab $ as a sort of dispatcher for implementation-specific stuff (much like it is used to access mzscheme in Anarki), and put anything that is more-or-less implementation-specific there.

-----

1 point by almkglor 6259 days ago | link

True; I'm thinking of something like using a dependencies list like you suggested, and doing dynamic recompilation (inlining the code) when a function call is done a particular number of times. I then keep a reference to the original code, in case some member of the dependencies list is modified.

The problem here however lies in the severely parallel nature of SNAP. If I'm recompiling code, then I either make a (process-local!) copy, or I force everything to use (comparatively slow!) locks just to read code. The better choice is the process-local copy but it has the drawback that optimizations that run on one process won't get passed to another T.T unless SNAP is run in single-worker-thread mode.

-----

2 points by shader 6280 days ago | link

I do too care. It's just that I'm not intelligent/experienced enough to understand what you're doing in a way that would allow me to help. I have very little experience with pl design, and even less with compiler/vm implementation (read, none). However, I'm very interested in how it turns out in case I could learn something and especially because I want to use it when you're finished.

So please, keep working on it, and updating us about the changes. And it's not like there's anything else going on around here for you to distract us from anyway.

-----

1 point by almkglor 6280 days ago | link

Well, this is my first time hacking together a VM anyway ^^. And arc2c is the first time I've hacked on a compiler, and I didn't even start arc2c.

Here are some interesting bits and pieces about VM's that I got via google trawl:

http://www.sidhe.org/~dan/blog/ - Squawks of the Parrot, a blog made by one of the first developers of Parrot VM. SNAP does things very differently from Parrot, partly because of its shared-nothing constraints, partly because this is a lisplike, and mostly because the Parrot VM is just something ^^.

http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ - about the squirrelfish VM. This is the main reason why I decided to use bytecode instead of AST traversal, and why I made a bit of an effort to reduce dependence on the stack (by introducing the -local-push and -clos-push variants of some bytecodes). It also has a bit about direct threading, which is portably implementable only on GNU compilers, and is the fastest portable implementation of bytecode interpreters (faster methods require assembly coding, or at least (in the case of dynamic inlining) some knowledge of what type of code can be copied without modification on what type of processor).

http://research.sun.com/self/papers/papers.html - a bunch of papers about Self, a smalltalk-like language and its implementation. I hear tell it's the most powerful VM on the planet, and its results are what is fueling Sun's recent JVM optimizations. The page itself does not contain the papers, but you can search for their titles and have a good chance of finding them online (or at least, when citeseer is up ^^)

And some more discussion:

Dynamic inlining is a sort of "poor man's JIT". Instead of generating a bytecode sequence as a vector of pointers-to-labels (as in direct threading), we directly copy the bytecode's implementation into the stream.

JIT is still the fastest way to implement an interpreter.

-----

1 point by almkglor 6280 days ago | link

Added a few more tests.

Also, the g++ version I've got is 4.2.3, and libboost is 1.34.1 . Ubuntu here is the 64-bit version, 8.04

-----