Arc Forum | SNAP: Shared Nothing Arc Processes the VM discussion

Arc Forum

SNAP: Shared Nothing Arc Processes the VM discussion

8 points by almkglor 6139 days ago | 52 comments

As requested by shader, move all your discussions about the new shared-nothing VM by almkglor, who hasn't even bothered to show a single bit of code.

Problems to discuss:

1. Message passing and memory management. Copy always or share-and-copy-on-write? Or something even weirder?

2. Global variables. Yes, you can treat global variables as "just another process"... but then you get the overhead of receiving a new "message" each time you read a global variable. Which is, conceptually, a copy, and might need to be done in that way in the VM at some point. Me, I say have each process maintain a "cache" of global variables, and if the process mutates a structure in a global variable (but not if it mutates the global variable itself), then the behavior is just undefined: maybe when the process reads the global variable again, it will get the mutated version, maybe not, maybe the computer will suddenly burn, maybe not... if the process mutates the global variable itself, it will lock out the other processes while changing the global variable.

1 point by almkglor 6139 days ago | link

From: http://arclanguage.com/item?id=7124

> you could use a Bloom filter to determine if the object is present on the other process.

Hmm. I did think of using Bloom filters, but then I would also end up "copying unnecessarily" anyway for some N% of false positives. Not to mention that I'd still have to build k different hash functions for pointers. So I decided, darn it, just copy all.

> How much are you implementing on the c++ side?

Memory management, worker threads (processes are green but are distributed among a set of worker threads), async I/O (technically Arc-side won't get async I/O, just sync, but then the Arc-side can launch a process to do the actual I/O and send a message to the original process that the I/O's done.) Async I/O is always a problem (even in arc2c), it doesn't seem to be very portable.

Might do a very very basic reader in C, maybe just enough to read numbers, symbols, and lists of same. I want to do the actual reader in Arc and use raymyers' treeparse ^^.

Also, I have to design the bytecode.

The arc2c "machine" is really a simple pushdown stack machine, except that function calls will simply reset the stack to the top N entries (where N is the number of arguments to the function call, including the function/object to call and the continuation).

My intent is to have the external representation of the bytecode use a stack machine similar to the arc2c stack machine. Internally, if the implementation can, it is free to recompile the bytecode to a register bytecode, which I hear is slightly faster because the number of instructions is reduced. It is free to use superinstructions, inline (but only if it's sure it can inline - in particular code that references, say, the 'car global function, should be able to adjust if 'car is redefined to, say, work on scanners), etc etc.

-----

2 points by shader 6139 days ago | link

Concerning bloom filters, they are only helpful if knowing the fact that a particular process has a reference matters when deciding to copy. So, it's useful for the case when you're sending an object to a process. Otherwise ref counting is better, because you aren't broadcasting "do you have this?" messages.

Bloom filters are probably only a good idea if your copy is a large one, rather than just a small variable.

Here's an interesting idea: for extremely large messages, have some sort of "lazy copying" where the process only copies pieces just before it uses them, or when there isn't a cost associated with the transfer. That way large copies don't slow down your system. You'd still increment the reference counter, if you were using one, and any mutations would have to change a different copy.

Actually, a good question is what the relative costs of sharing vs. copying are. I would think that for small things copying is less expensive than keeping track of who has it, etc., just to avoid copying. For bigger items, it might be worth avoiding a premature copy. I don't know.

Edit 1: About the k hash functions, I think that if there is low enough correlation between bytes in the result you can break up one hash function into several. Also, most good hash functions, (the ones that might be too slow for this purpose) have enough of a difference in the result with slight changes to the input that you could add just 1 to the value several times, and get a dozen hashes. So you could use:

  (hash val)
  (hash (+ val 1))
  (hash (+ val 2))
  ...

and get "3" hash functions. If you break each hash into bytes as well, and each produced multiple bytes, you could easily have 6, 9, or 27 "hash functions"

-----

2 points by almkglor 6139 days ago | link

> So, it's useful for the case when you're sending an object to a process.

Yes, but it is in fact the only case where we copy objects anyway (even reading a global variable would be effectively sending the object in the global to the process in question)

> for extremely large messages...

  #define N    I don't know
  if(msg.total_size() > N){
    msg.share(process);
  } else {
    msg.copy(process);
  }

I think this will end up having twice the overhead though: you need mechanisms to support both sharing with copy-on-write-and-if-receiver-already-has-it and copying.

Take note also that sending messages is recursive: we might be sending a list and thus end up sending several items.

> I would think that for small things copying is less expensive than keeping track of who has it, etc., just to avoid copying.

This was the justification in the Erlang BEAM and JAM machines to always copy.

> About the k hash functions...

Cool, I didn't think of that!

-----

2 points by shader 6139 days ago | link

So then, should we just always copy?

Unless we make some special immutable types as mentioned above; then we could share those. Maybe I'm just prematurely optimizing, and should just give up on trying to find ways to avoid the copy. Life is much simpler just copying. :)

How about just starting with always-copy, and later when we have performance data, and experience using it, we can code exceptions? I don't know how easy it would be to code such exceptions, but it would be less likely to break if we didn't make these decisions without empirical evidence.

>twice the overhead... I didn't think it would create any extra mechanisms, just a short circuit rule when deciding to copy or not. Rather than checking the ref counts each time the item is mutated or sent, if it is smaller than a certain amount, we know it was copied.

Now, maybe ref counts are faster than total_size(), but I don't know.

Certainly, always copying would avoid this whole overhead business. :)

-----

4 points by almkglor 6139 days ago | link

Okay, here's a github public repo:

http://github.com/AmkG/snap/tree/master

-----

2 points by shader 6135 days ago | link

How's SNAP coming?

Would you mind too terribly much explaining/commenting the code so that I could follow the thought process? I'd like to help, as erlang mixed with lisp just sounds awesome, but I'm not the best with c++, much less c++ I don't understand. ;)

-----

2 points by almkglor 6134 days ago | link

Nicely enough, although I haven't managed to push some of my more recent changes on the github (for some inexplicable reason our internet connection is very slow...).

There are quite a few comments in the code but no real top-level description of the code yet. Here's a first attempt:

The basic class structure for memory management is:

  Generic (abstract/non-instantiatable base class)
   +- .... all Arc objects

  Heap (abstract/non-instantiatable base class)
  +-Globals (concrete)
  +-Process (concrete)

  Semispace

  Heap contains a pointer to Semispace, "s"
  Heap contains a vector of Semispace pointers, "other_spaces"
  Heap has the following dynamic functions:
    get_root_set()
      - gets the root set of this heap, i.e. if it's a
        Process, the process's stack and cache of globals

  A Semispace is defined as a memory area containing 0 or
  more Generic objects.  Each Generic may have 0 or more
  references to other Generic objects.

  A Semispace may be held by only one Heap.

  Each Generic may refer only to Generic objects in the
  same Semispace, or to another Semispace held by the
  Heap, i.e. can only refer to Generic objects in the same
  Heap.

  When a Heap wants to transfer an object to another Heap,
  it must copy all those objects into a new Semispace
  (one which it, conceptually, does not "hold" except
  during initialization), then adds it to the destination
  Heap's "other_spaces" vector (obviously locking is
  required).  While holding the lock it must also add
  the copied memory to the mailbox or wherever in the
  destination's root set it can be held.

  During a GC, the Heap holds its lock, then computes
  a "good" size for a new Semispace, creates it, then
  copies the root set to the new Semispace.  Then it
  releases all its Semispace objects (in "s" and
  "other_spaces")

So, err. that's memory management so far. ^^

-----

1 point by almkglor 6134 days ago | link

Here's a further bit, about ToPointerLock:

  ToPointerLock is a RAII class which protects the
  to-pointers of objects.

  The to-pointers are used for tracing and copying.

  Each Generic includes a to-pointer.  This is private to
  the Generic class, and is not accessible except via
  a ToPointerLock.

  During copying, the copy routine gets a reference
  from the to-copy set.  Then it queries the ToPointerLock's
  to() function on the reference.  If it returns a
  non-NULL pointer, the reference is changed to that
  value.  Otherwise, the copy routine clones the object
  to the destination Semispace, then sets the to-pointer
  via the ToPointerLock's pointto() function, and then
  changes the reference.

  ToPointerLock keeps track of modified to-pointers.  If
  something throws and the ToPointerLock goes out of scope,
  it will reset the to-pointers back to NULL (so that
  the Heap will remain valid).

  The ToPointerLock can be told to forget about resetting
  the to-pointers back to NULL, for example after a
  successful GC, where the memory areas will be reclaimed
  anyway and the to-pointer values don't matter.  This is
  done by calling the good() function.

-----

2 points by almkglor 6125 days ago | link

Here's a widdle bit more, about Executor's, BytecodeSequence's, and Bytecode's:

  An Arc function is represented by the Generic-derived
  Closure class.  The Closure class is composed of a
  function variable and a vector of Generic pointers.

  The Closure's function variable is a pointer to an
  Executor.  An Executor is usually just a wrapper around
  an _executor_label variable, but a derived class called
  ArcExecutor includes a BytecodeSequence as well.

  The implementation of the _executor_label type will
  depend on whether you are using GCC or not.  If you're
  using GCC, then _executor_label is a void* and will be
  a pointer to a lebel in the source code.  Otherwise it
  will be an enumeration, which will be used in a switch
  statement for standard C.

  (GCC supports an extension to C / C++ where the pointer
  to a source code label can be taken using &&label.  In
  standard C / C++ this is not allowed.  This is used to
  speed up interpretation so we don't have to go through
  a switch statement, insted we just use indirect jumps)

  A BytecodeSequence is just a sequence of Bytecode's (the
  exact format is not yet well defined - might be a linked
  list for simplicity, or might do some hackery to put
  them into a vector).  Most Bytecode's are (like
  Executor's) just a wrapper around a _bytecode_label type,
  which (like _executor_label) will be either a void* or an
  enum depending on your C++ compiler.

  However there are two additional variants of Bytecode.
  One is the Bytecode with BytecodeSequence variant, called
  a SeqBytecode.  A SeqBytecode is used to represent an
  inner function prior to creation from a closure:

    arc2c AST:
    (%closure (fn (self k) (k (%closure-ref self 0)))
              x)
    SNAP symbolcode:
    ( (fn (check-vars 2)
          (local 1 #|k|#)
          (local 0#|self|#) (number 0) closure-ref
          (apply 2))
      (local 2 #|x|#)
      closure
    )

  It's also used to represent an if:

    arc2c AST:
    (if x
        (k 0)
        (k 1))
    SNAP symbolcode:
    ( (local 2 #|x|#)
      ; the if contains only the then branch
      ; when flow of control enters the if, it won't
      ; leave the branch!
      (if
         (local 1 #|k|#)
         (number 0)
         (apply 2)
      );else
      (local 1 #|k|#)
      (number 1)
      (apply 2)
      )

  Another kind of Bytecode is the IntBytecode, which are
  used to represent symbolcodes with parameters, such as
  local and apply above (in fact a majority of Bytecode's
  will be IntBytecode's)

  I might also need bytecodes for strings, although I
  might just insert a step in the arc2c compilation
  code to transform them (as well as quoted lists) into
  the following code:

    (def foo ()
      (pr "foo"))
    =>
    (let quoted (string (cons #\f (cons #\o (cons #\o nil))))
      (def foo ()
        (pr quoted)))

  Character literals would probably also be transformed
  to IntBytecode's where the int is the character's
  unicode.

  Then there are the symbol literals ^^, which I haven't
  thought of yet.

-----

1 point by almkglor 6123 days ago | link

Okay, I've been (re)thinking about bytecodes a bit.

The current arc2c code generator essentially outputs a stack-based "bytecode", with the bytecodes being directly output as C macros.

However I've been reading some good things about register-based bytecodes recently. In essence register-based bytecodes have the advantage of describing more operations per bytecode, which reduces the number of dispatches in a function - i.e. we have fewer but longer instructions, unlike a stack-based bytecode system which has simpler but more instructions.

Now from my analysis of arc2c, in the final AST output (just prior to code generation) each function's body is composed of just one thing. It can either be a function call, or an if.

An if statement would then have either a variable reference or a primitive call in its condition, and its then and else branches are either function calls or if statements of the same type.

The component sub-expressions of a function call would then be either variable references or a primitive call (function literals at this time have been converted to primitive calls to %closure, and any nested function calls will have been transformed away by CPS conversion).

So it seems that pretty much everything can be transformed to variable references.

Hmm...

Edit: Waa!! I'm being forced to study MFC against my will!! LOL

-----

1 point by almkglor 6121 days ago | link

If anyone at all's still interested, I've pushed some notes on the bytecode system. I've also started to adapt bits of the arc2c compiler (a few changes were necessary in the final cps and closure-conversion changes) and creating the bytecode generator (unpushed though).

There's an untested symbolcode-to-bytecode converter in the executors source of SNAP too.

The VM is a somewhat register, mostly stack machine now. A few bytecodes were modified/added to allow direct access from a local variable instead of just modifying the stack top. So a sequence {(local N) bytecode} becomes {(bytecode-local-push N)}, removing one bytecode dispatch. It still pushes on the stack top though, but then given that there are relatively few "addressing modes" so to speak, it's not bad.

I suspect it will make JIT, if ever, a bit worse though.

-----

2 points by almkglor 6115 days ago | link

Of course nobody actually cares anyway, but SNAP's bytecode engine now seems to be working.

The tests/executors_test.cpp contains the test for the bytecode/executor engine. It's very basic - it just converts a list-encoded symbolcode/bytecode to a function, then executes that function.

The symbolcode calls a builtin function, '$ (which I'm reserving for implementation-specific Arc stuff). Since this is CPS, the builtin function returns by calling back to a bytecoded/interpreted function.

Assuming you have boost in a standard include, and g++, you can try it out by downloading/gitting from http://github.com/AmkG/snap/tree/master , then go to the tests/ directory and:

  ./compile executors_test.cpp
  ./a.out

IIRC the only thing I've used so far in boost are the shared_ptr and scoped_ptr, which are probably the most portable parts of boost (and don't require compiling a library).

It's tested on 8.04 Ubuntu with the versions of g++ and boost included in it (don't remember exact versions, and don't currently have access to my machine).

This will probably end up being how SNAP initializes:

  1. Construct a bytecode list for compilation by the
     executor system.  This bytecode list will be derived
     from a version of arc2c compiling itself with the
     assumption that it's safe to use macros in the same
     environment.
  2. Convert the bytecode list to a function.  Since this
     requires the executor system, which assumes that it
     normally executes interpreted code and only
     occassionally built-in CPS functions, this conversion
     will need to use some temporary global variables to
     access bits of the executor system that are normally
     accessible only to interpreted code.
  3. Execute the function.  This will end up being similar
     to how interpreted code is normally executed, except
     that we don't have to go through work queues etc.

And of course... a few more tests would be necessary for the executor system, then we have to consider how to handle a worker thread pool to execute processes, and also to disable the use of a worker thread pool if we want to use a single OS thread.

-----

1 point by almkglor 6103 days ago | link

Update...

Currently I'm waffling on I/O. Asynchronous I/O is hard, let's go redditing...

Originally the plan was that there would be a separate builtin type for I/O ports, which would be just shared_ptr wrappers around AsyncPort objects.

Of course there's the rather minor problem though when several separate Arc processes try to access the port.

And there's also the minor problem where in Linux some (most?) filesystems assume they're so fast that they will always return "ready" for select() or poll(). Haha.

Currently I'm trying to implement ports as separate C++ processes with their own message queues.

And just suddenly now, it dawned unto me that this time there's the minor problem where I/O is performed on reads. A read might consume a variable number of characters; the problem is, which variable number of characters? How do I determine, before I even look at anything, how many characters to consume? Since I only get one queue entry, and another process might get the next entry? Take for example the 'read function: it knows it will start a list when it sees #\(, but it can't know how many characters, total, the list will be. We can't "release" the port until we actually complete the 'read function, because if we do, another process might read from the port and start reading random characters, meaning we (this process and the other proess) both end up not reading anything useful.

I'm not making sense I think.

So in the end I may very well have AsyncPort objects in the Arc-side, and when reading pass in an entire function which will consume characters, and then when the function returns a value (or throws an error), then we "know" that the port is released and another process can grab it by passing in a new function.

-----

1 point by stefano 6103 days ago | link

For the read problem, the shared I/O port should have a buffer containing the characters read, and an index on this buffer for every process that read data from it. This way every process can act as if it were the only one to read data because it will read starting from its own index. This doesn't work for writing data, of course, so output port shouldn't be shared.

-----

1 point by almkglor 6103 days ago | link

Suppose process A reads up to the second-to-the-last character. Then suppose it passes the port to process B. Process B does a read. What does it read? First character or last character? What would the programmer reasonably expect?

-----

2 points by stefano 6102 days ago | link

Both the options seem reasonable to me. If B wants to read the first character probably he would open a new port, so when B receives a port from A he should read from the last character. What if now A tries to read from the port? To keep the behavior of the virtual machine consistent (always copy) A should read the last character, and not EOF. If this didn't happen, B would have successfully changed the state of something belonging to A. In conclusion, every process should have their own index in the buffer, and when a port is passed from A to B, B gets a new index initialized with the value of A's index.

-----

1 point by almkglor 6101 days ago | link

Huh. Didn't think of that.

Anyway, the internet connection at home is down, I've got girl problems again, and I am in the mood to sleep before I start shooting everybody (especially since I suck at first player shooters, so playing those games just pisses me off). Don't anybody go around expecting anything decent for SNAP this weekend, not that there was anything decent there anyway.

-----

2 points by almkglor 6099 days ago | link

http://www.monkey.org/~provos/libevent/ Hmm. Possibly useful, we might be able to use its event_loop(). Possibly we can implement a central I/O process written in C++ which simply waits for I/O, then notifies the calling process that actually performs I/O.

As an aside, the problem with having separate buffer pointers for each process in each input port is that it stops being orthogonal to output ports, where it's simply impossible to have separate buffer pointers.

So I think I'm putting it back to "I/O ports are really processes", where the I/O port process is an Arc process that acts as a serializing mutex around the I/O port.

-----

1 point by almkglor 6096 days ago | link

Oh man, asynch I/O is hard...

As an aside, the newer (but presumably less well-developed) libev http://software.schmorp.de/pkg/libev.html supports nice timeouts and child process monitoring, which would really help in implementing 'sleep and 'system.

Further, I'm also thinking of ways of reusing continuation closures.

In arc2c many continuation closures can be eliminated because the compiler can inline global functions with impunity (simply by detecting if a global variable is assigned with a function exactly in one place, at the top-level). These remove the need to construct continuations for many cases, such as calling the basic 'car and 'cdr functions (they are inlined to the 'car and 'cdr primitives), which in most cases are defined only once, in the library.

However in SNAP we want to support full dynamism, so we can't do inlining for many cases and we must actually perform a CPS-style call, which requires constructing a continuation closure.

A continuation closure, once invoked, can usually be discarded immediately; in fact, the continuation closure's data can even be completely overwritten.

The only exception here is with 'ccc. So, what I'm thinking is, we add a new type of closure, a k-closure, which is just a subtype of standard closures. It contains an additional bool, specifying if it can be safely reused (the default).

When 'ccc is invoked, it clears that bool. In addition, it must also search through the entire existing "stack" of k-closures, clearing all their reusable flags.

A continuation function can then simply reuse its own k-closure while constructing a new continuation, unless the reusable flag is cleared.

-----

1 point by almkglor 6093 days ago | link

Okay, I've implemented the continuations idea above. I also tested it a little, and tested a beta version of the arc to bytecode compiler (that's now on the git). Testing revealed a hidden bug in the executors framework ^^ specifically the first bytecode in each function did not get executed. Since the most usual first bytecode is a check of the number of parameters, I didn't see the error ^^

I'm thinking of also implementing the continuations idea above in arc2c.

Edit: Oh, and since we're on the subject of continuations: I don't know, but the lack of a full 'dynamic-wind support in Arc seems rather, err, well, puzzling. What's supported is about half of 'dynamic-wind, i.e. the half that handles exiting the 'dynamic-wind; it's exposed as the ac.scm 'after. This means that if someone creates a generator library and does something like:

  (defgen foo (f)
    (w/infile p f
      (yield (read p))))

... it won't actually work, because once 'yield calls outside of the context of 'w/infile (via a continuation), the file is closed and can't be reopened (because 'w/infile doesn't use an "before" handler, just an "after" handler).

What I'm wondering about is: is this actually OK? If I implement just "after" handlers on continuations, is that "good enough" for Arc?

-----

1 point by stefano 6096 days ago | link

How do you safely inline functions?

Suppose you have a file with these definitions:

  (def f () 8)
  (def g () (f))

and you compile it. arc2c would then inline f when called by g. Suppose now that I load the compiled file from the repl and then type:

  (def f () 5)

now calling g from the repl should return 5, but f were inlined before, so it will return 8. Does arc2c handle this kind of problems? If it does, how is this implemented? I'm asking because I wanted to do a similar optimization in nyac, but this problem blocked me.

-----

1 point by almkglor 6096 days ago | link

Simple: arc2c doesn't have a REPL ^^

Basically, it expects all definitions to be in the file. It only inlines definitions that are:

1. Done at the top-level, including loaded and required files

2. Defined only once (i.e. assigned to only once)

Basically this is largely an optimization of the arc.arc and ac.scm libraries.

This optimization is useable only if 'eval is not ever used. If 'eval is ever used, this optimization will have to be disabled.

-----

2 points by stefano 6095 days ago | link

I wonder if there is a way to make that optimization work also when eval is used, maybe by tracking a dependecies list for every function, because it is really helpful. To make an example: the '- function takes a variable number of args, so all the arguments passed to it must be consed. With inlining it would be possible to avoid consing the arguments when they are known to be 1 or 2. To give you an idea of how much consing the args costs, take the fibonacci example: to calculate the 32nd number on NYAC when consing '- args it takes (on my computer) ~3.4 secs, without consing '- args it takes ~0.6 secs. This is a huge difference.

-----

1 point by almkglor 6093 days ago | link

> With inlining it would be possible to avoid consing the arguments when they are known to be 1 or 2.

Hmm.

Since I control the VM anyway, it may be possible for me to add some sort of feature/bytecode/built-in-function that can perform reduction of arguments without consing (i.e. just by allocating a continuation function and reusing it). It would be possible to also have it "short-circuit" so to speak, prevent it from allocating a new closure and just pass its own called continuation directly to the child function.

Basically it would be like this:

  (with (f0 (fn () (handle-0-arguments))
         f1 (fn (a) (handle-1-argument a))
         f2 (fn (a b) (handle-2-arguments a b)))
    (fn rest ; not actually expanded into a list
      ; this will be implemented in C
      (if
        (no rest)
          (f0) ; pass the continuation
        (no (cdr rest))
          (f1 (car rest))
        (no (cdr:cdr rest))
          (f2 (car rest) (cadr rest))
        ; perform reduction
          (ccc
            ; enclose these variables
            (with (a (f2 (car rest) (cadr rest))
                   rest (cdr:cdr rest))
              ; also enclose the continuation
              (afn (k)
                (zap a (f2 (car rest) (cadr rest)))
                    ; rest will be an index into the
                    ; closure variables
                (if (zap cdr rest)
                    (self k)
                    (k a)))))))))

Oh and here's a blog of SNAP: http://snapvm.blogspot.com/

-----

2 points by stefano 6092 days ago | link

So handle-n-arguments is a special form? Exposing such a special form would break compatibility with both arcN.tar and Anarki, but this seems a good solution.

> Oh and here's a blog of SNAP: http://snapvm.blogspot.com/

Nice! This way informations about SNAP will be no more spreaded in the forum. I suggest copying everything already present in the forum about SNAP into the blog.

-----

1 point by almkglor 6092 days ago | link

It won't be exposed as a special form - rather it will be exposed as a bytecode, and the symbolcode-to-bytecode compiler will be exposed as a function that the implementation specific '$ will return. This should reduce the conceptual footprint of extensions to Arc that I add strictly for efficiency, versus actual axioms.

Basically we just grab $ as a sort of dispatcher for implementation-specific stuff (much like it is used to access mzscheme in Anarki), and put anything that is more-or-less implementation-specific there.

-----

1 point by almkglor 6093 days ago | link

True; I'm thinking of something like using a dependencies list like you suggested, and doing dynamic recompilation (inlining the code) when a function call is done a particular number of times. I then keep a reference to the original code, in case some member of the dependencies list is modified.

The problem here however lies in the severely parallel nature of SNAP. If I'm recompiling code, then I either make a (process-local!) copy, or I force everything to use (comparatively slow!) locks just to read code. The better choice is the process-local copy but it has the drawback that optimizations that run on one process won't get passed to another T.T unless SNAP is run in single-worker-thread mode.

-----

2 points by shader 6114 days ago | link

I do too care. It's just that I'm not intelligent/experienced enough to understand what you're doing in a way that would allow me to help. I have very little experience with pl design, and even less with compiler/vm implementation (read, none). However, I'm very interested in how it turns out in case I could learn something and especially because I want to use it when you're finished.

So please, keep working on it, and updating us about the changes. And it's not like there's anything else going on around here for you to distract us from anyway.

-----

1 point by almkglor 6114 days ago | link

Well, this is my first time hacking together a VM anyway ^^. And arc2c is the first time I've hacked on a compiler, and I didn't even start arc2c.

Here are some interesting bits and pieces about VM's that I got via google trawl:

http://www.sidhe.org/~dan/blog/ - Squawks of the Parrot, a blog made by one of the first developers of Parrot VM. SNAP does things very differently from Parrot, partly because of its shared-nothing constraints, partly because this is a lisplike, and mostly because the Parrot VM is just something ^^.

http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ - about the squirrelfish VM. This is the main reason why I decided to use bytecode instead of AST traversal, and why I made a bit of an effort to reduce dependence on the stack (by introducing the -local-push and -clos-push variants of some bytecodes). It also has a bit about direct threading, which is portably implementable only on GNU compilers, and is the fastest portable implementation of bytecode interpreters (faster methods require assembly coding, or at least (in the case of dynamic inlining) some knowledge of what type of code can be copied without modification on what type of processor).

http://research.sun.com/self/papers/papers.html - a bunch of papers about Self, a smalltalk-like language and its implementation. I hear tell it's the most powerful VM on the planet, and its results are what is fueling Sun's recent JVM optimizations. The page itself does not contain the papers, but you can search for their titles and have a good chance of finding them online (or at least, when citeseer is up ^^)

And some more discussion:

Dynamic inlining is a sort of "poor man's JIT". Instead of generating a bytecode sequence as a vector of pointers-to-labels (as in direct threading), we directly copy the bytecode's implementation into the stream.

JIT is still the fastest way to implement an interpreter.

-----

1 point by almkglor 6115 days ago | link

Added a few more tests.

Also, the g++ version I've got is 4.2.3, and libboost is 1.34.1 . Ubuntu here is the 64-bit version, 8.04

-----

2 points by cchooper 6139 days ago | link

By globals variables, do you mean regular globals? I would have thought these should be private to each process. After all, a process shouldn't mess with another process's state. It's rude!

-----

4 points by almkglor 6139 days ago | link

'car, 'cdr, 'list .... global variables all. Private copies for each process? All of arc.arc in private copies for each process?

How about when the user, say, loads lib/scanner.arc and then expects that existing running processes are now able to handle scanners?

There's a reason I suggested using some sort of cache for these.

-----

2 points by shader 6139 days ago | link

If they're immutable, then great. But using shared memory to transfer state is flawed, and can cause race conditions, etc. That's why we're using the actor model to begin with, right?

I suppose they can't really be immutable, or we couldn't do hot code loading. How does Erlang do it?

-----

1 point by almkglor 6138 days ago | link

> But using shared memory to transfer state is flawed,

cough let the programmer second-guess you cough

> How does Erlang do it?

It doesn't, actually. Functions are still effectively global variables, ones which you can't write to except via the c() BIF.

-----

4 points by shader 6138 days ago | link

>cough...

Heh, you're right. So, in that mindset, why not give a lot of nice macros for controlling share vs copy, and make the default be copy? Then programmers could control nearly everything. Of course, they could always hack on your vm if they really wanted tons of control.

But still, concurrent writing to a global variable sounds dangerous.

I kind of like the idea of them being "registered processes." I'll have to do some more thinking on that.

>It doesn't, actually...

Yes, that answers some of the question, but I was a bit more interested in how they implemented their hot code loading. The code still exists for a while as the existing processes continue to use it. But they eventually phase out the functions and swap to the new ones.

IMHO, hot code loading is a very nifty feature. Combined with remote REPL makes it especially useful. I don't know how well current lisps support hot swap, but I don't think it can work effectively without a concurrent system.

-----

3 points by almkglor 6138 days ago | link

> Yes, that answers some of the question, but I was a bit more interested in how they implemented their hot code loading

It's in the OTP library actually. For example, they have a standard gen_server module. The gen_server would look approximately like this in snap:

  (def gen-server (fun state)
    (<==
      ('request pid tag param)
        (let (state . response) (fun state param)
          (==> pid (list 'response tag response))
          (gen-server fun state))
      ; hot code swapping!!
      ('upgrade new-fun)
        (gen-server new-fun state)
      ('stop)
        t))

So yes: hot swapping just means sending in a message with the new code ^^

It's actually more complex than that - they generally make messages include a version number so that nodes with new versions can communicate to nodes with older versions. This versioning is, in fact, part and parcel of the gen_server series of functions. Requests to servers are made via functions which send a message (with tags and other metadata such as versions abstracted away) and wait for a receive.

I think what they say is, the programmers good at concurrency write the gen_server and other parts of the OTP, while the average programmers write application code ^^

Much of Erlang isn't implemented in the VM level ^^

-----

3 points by shader 6138 days ago | link

It makes sense that they wouldn't do that at the vm level. Your code even makes sense, though I thought "let" only assigned one variable.

I'm still not quite able to read arc fluently, so any explanations of the subtle that I likely missed will always be appreciated. Come to think of it, any explanations of any code would be nice, as the thoughts and reasons behind code don't always come out in the source itself. And I also like learning new things :)

-----

2 points by almkglor 6138 days ago | link

  (def gen-server (fun state)
    (<==

I'm using pattern matching. Although Arc doesn't actually have pattern-matching built in, someone wrote a pattern matching library a long time ago using macros: http://arclanguage.com/item?id=2556 http://arclanguage.org/item?id=1825 . The modern evolution uses something like p-m:def to define a pattern-matching function, p-m:fn to create an anonymous pattern-matching function, etc.

      ('request pid tag param)

The pattern above means "match a 4-element list, whose first element is the symbol 'request, and which has 3 more elements that we'll call pid, tag, and param".

        (let (state . response) (fun state param)

This is a destructuring. It simply means that (fun state param) should return a cons cell, with the 'car of the cell being placed in state and the 'cdr being placed in response. So we expect fun to return something like (cons state response)

          (==> pid (list 'response tag response))

Note the use of 'tag here. We expect that 'tag would be a 'gensym'ed symbol, and is used in pattern matching so that the client can receive the message it's looking for.

          (gen-server fun state))

Plain recursion.

      ; hot code swapping!!
      ('upgrade new-fun)
        (gen-server new-fun state)

This is it, really: just give a new function.

      ('stop)
        t))

-----

2 points by cchooper 6139 days ago | link

Functions (that aren't closures) can be safely cached because they're immutable. If we assume arc.arc is part of the 'spec' (and hence itself immutable) then we can safely link each process to the same functions, but give each one it's own global bindings, maybe?

-----

2 points by almkglor 6138 days ago | link

> arc.arc is part of the 'spec' (and hence itself immutable)

But ac.scm itself is not immutable. cref lib/scanner.arc , which redefines 'car and 'cdr (which are in ac.scm). If ac.scm, which is even more basic than arc.arc, is itself not immutable, then why should arc.arc be immutable?

So no.

In arc2c functions are represented by closures. Pointers to closures are effectively handles to the actual function code.

Now the function code is immutable (that's how arc2c does it - after all, all the code has to be written in C). When a function is redefined, we create a new closure, which contains a pointer to the new function code (which was already compiled and thus immutable), then assign that closure to the global variable.

Basically my idea for a cache would also have an incremented update pointer:

  class SymbolAtom : public Atom {
  private:
          std::string name;
          Generic* value;
          size_t version;
  public:
          friend class Process;
  };

  class Process : public Heap {
  /*blah blah heap stuff...*/
  private:
          std::map<Atom*, pair<size_t, Generic*> > g_cache;
  public:
          Generic* get_global(Atom* a){
                  std::map<Atom*, pair<size_t, Generic*> >::iterator i;
                  i = g_cache.find(a);
                  // not in cache
                  if(i == g_cache.end()){
                          Generic* mycopy = a->value->clone(*this);
                          g_cache[a] = pair<size_t, Generic*>(a->version, mycopy);
                          return a->value;
                  } else {
                          pair<Atom*, pair<size_t, Generic*> > ip = *i;
                          pair<size_t, Generic*> ipp = ip.second();
                          // no change, return value
                          if(a->version == ipp.first()){
                                  return ipp.second();
                          } else {
                                  //recache
                                  Generic* mycopy = a->value->clone(*this);
                                  g_cache[a] = pair<size_t,Generic*>(a->version, mycopy);
                                  return mycopy;
                          }
                  }
          }
  }

-----

2 points by shader 6139 days ago | link

Agreed. Depending too much on global state sounds risky, though I suppose it depends on how you abstract it. You could have two types of 'globals'. One being the traditional scopeless variable, local to each process, and the other a process that is registered somewhere that makes it easy to find. So, instead of requiring pre-knowledge of the pid, you can find it easily. That sort of sounds like a global variable, but it wouldn't be limited to just storing a value.

Unfortunately, you'd still need to handle the problems that would occur if one of those global procs died, or if the table got corrupted.

-----

1 point by shader 6132 days ago | link

Are there any plans in arc2c or SNAP to reify pointers and direct memory addressing in arc? I realize that isn't very "lispy" but it would be way cool to be able to do low level byte mashing/memory management in an otherwise high level language.

pointers + macros = really cool or really dangerous.

Maybe we should go for something a bit like erlang's bit syntax instead.

-----

1 point by almkglor 6132 days ago | link

Err, no. Hehe. Although I believe some Lisps actually implemented their memory managers (GC!) in Lisp itself ^^, but I think it was back when Lisp was more imperative (i.e. when (prog ...) with labels was more popular).

Hmm. Interesting idea, although I can't quite imagine how to do this.

-----

1 point by shader 6120 days ago | link

"interesting" doesn't convey much information. Does it mean positive interesting, as in you'd be interested in trying it? Or does it mean negative interesting, as in your interested in why I would want such a crazy feature in the first place? :)

-----

1 point by almkglor 6120 days ago | link

> Does it mean positive interesting, as in you'd be interested in trying it?

This one. If it were the negative interesting, I'd ask you directly "what for?" ^^

Hmm. Although it might be difficult to do with a copying GC.

As an aside, I'm thinking of exposing the "sharedvar" objects into arc, basically they would be called the "container" type. An arc-on-mzscheme implementation would be:

  (require "lib/settable-fn.arc")
  (def container ((o val))
    (add-attachments
      '= (fn (v) (= val v))
      (annotate 'container
        (fn () val))))

Basically it would be used this way:

  (= foo (container 0))
  (foo)
  => 0
  (= (foo) 42)
  (foo)
  42

Basically assigning to it would be equivalent to setting the pointer, and (foo) would be equivalent to (* foo) in C.

I thought it might be useful in a C FFI (when I get there in, say, T=inf). If a C function accepts a int*, for example, where it would put an int, you would pass in a container, which would automagically get the int.

-----

1 point by shader 6120 days ago | link

Sounds like a reasonable syntax. Would the value in a container in a container be accessed with ((foo))? It fits, but it might get old after a while.

Also, how similar would these "containers" be to pointers? It doesn't seem like it would be as useful, because I doubt you can increment/decrement a container.

So, what are the problems we might encounter if we actually use this?

-----

1 point by almkglor 6120 days ago | link

The value pointed to by the container would be accessed with (foo), just a single layer of parens. The container's pointer would be mutated as (= (foo) ...) though. Oh well.

Edit: Oh, you said "container in a container". Yes, ((foo)) it is.

> It doesn't seem like it would be as useful, because I doubt you can increment/decrement a container

This functionality would be in a scanner, although a scanner, technically, would be just an iterator.

For instance, I fully intend to implement string scanners and input file scanners on the C++-side in SNAP; they would be somewhat "safe" wrappers around C++ iterators, which are just a generalization on the pointer. so (car foo) would be *foo and (cdr foo) would be (foo + 1).

> what are the problems we might encounter if we actually use this?

None at all, if you're talking about just containers and scanners. A container is just a reference to an Arc object. A scanner would just be an Arc-safe wrapper around an iterator.

-----

1 point by almkglor 6070 days ago | link

http://okmij.org/ftp/Scheme/pointer-as-closure.txt

-----

1 point by applepie 6139 days ago | link

I wonder if the aim of the vm is to be faster than the standard Arc implementation, or just to have a means of reifying the continuation and thus be able to stop a process to do process management.

I would go for simplicity and copy always. Unless Arc had some kind of inmutable data structure and the vm were able to copy just the mutable parts.

-----

3 points by almkglor 6138 days ago | link

> I wonder if the aim of the vm is to be faster than the standard Arc implementation, or just to have a means of reifying the continuation and thus be able to stop a process to do process management.

More of a way of taking advantage of multiple processors/cores while not having to do a lot of process management. Of course, there's not much point in using multiple processors anyway if the VM itself is slow ^^

> I would go for simplicity and copy always. Unless Arc had some kind of inmutable data structure and the vm were able to copy just the mutable parts.

I agree, I did have a plan for a sometimes-sharing, copy-on-write-and-if-receiver-has-a-shared-copy VM, but it felt more complex than necessary.

It would probably be useful to also have binary blobs like in Erlang.

-----

1 point by binx 6139 days ago | link

2. Consider this: when a process is created, a private heap is allocated for it and the "cache" of global variables is copied to the heap. When each process mutates any structures, check whether the structures are in its own heap. If not, we can raise an error.

-----

1 point by almkglor 6139 days ago | link

Well, first: does reading a global variable at some point create a copy or not?

Having to do the checking on each write has the slight disadvantage of slowing down writes. For that matter, we still have to consider how heaps are organized.

For example, in my current stop-and-copy, always-copy-on-message-send, each heap has several semispaces. Semispaces conceptually belong to only one heap. One of the semispaces is the current semispace, the one that is being allocated into. The others were semispaces that were received during message sending.

When a message is sent, it is copied into its own semispace, then the semispace is added to the list of "other semispaces" held by the heap.

During a GC, the GC copies all semispaces into a single new semispace, then the heap releases all the other semispaces and sets the new semispace as the current semispace.

In theory it should be possible to reduce slightly the GC overhead by just copying some (but not all) semispaces into the newest semispace: this is basically an incremental GC

Edit: As an aside, the advantage of leaving it "undefined" is so that we can change the implementation of global variables without committing ourselves to stone. "What, you kept a table in a global variable and expected to use that to communicate with other processes? Well, sorry, but that won't necessarily work (or not work)"

-----