Yeah, this is the approach taken by chez, mlton, and many recent compilers of functional languages.
And all the optimization stuff(unboxing of "sharedvar", inlining, type inference, known function analysis, unused variable elimination, etc) can be made agressive by only global flow analysis, which is rather time-consuming. But I'm curious to know how far a relatively conservative compiler which doesn't do any flow analysis is able to go. If stalin performs a little worse, but compiles 10x times faster, I believe that much more people would use it.
Sharedvar unboxing is a little difficult, since there are two types of local variables: those in closures, and those in parameters:
(let kept ()
(set kept (cons x kept)))))
(set x (rev x))
So basically we need two types of local-variable-set primitives: one for closures-variable-set (first case above) and another for parameter-variable-set (second case)
Re: Stalin - is it that slow? Meaning an order of magnitude improvement of time is needed to make it comfortable?
Type inference: well I can't think of a good way of getting type inference generically, but certainly it's possible for e.g. '+. '+ requires that all parameters are either numbers, or all strings, or all lists, and if we can determine that one parameter is of a specific type, we can put the checking that the other parameters are of that type and immediately bind the + to the specific type.
For example if we have %n+ for numeric addition, %s-join for string concatenation, and %l-join for list concats:
(+ x y z)
(+ x y z) ; can't determine type
(+ 1 x)
(%n+ 1 (let check x
(if (is (type check) 'num)
(err "+: type mismatch"))))
(+ (list 1 2 3) x)
(%l-join (list 1 2 3)
(let check x
(if (is (type check) 'cons)
(err "+: type mismatch"))))
Stalin might be the most optimizing but slowest functional language compiler ever written.
Sharedvar unboxing is not an important issue because it doesn't make much difference in efficiency. Most scheme programs don't update local variables very often. The most useful optimization parts are(in my opinion): Special treatment of let and letrec, inlining and known function detection.
General ML-style type inference for scheme is impossible. What we can do is to infer as more types as possible.
1)Trasforming let and letrec to ((fn (...) ...) ...) is not efficient. First, it would allocate a closure. Second, it would perform a function call. Instead, the variable bound by let and letrec should be allocated on the stack, and no function calls are needed.
2)For example, in:
If f is statically known, and f's environment is null or is the same as the environment of the calling site, then the function call should be a direct jump. It eliminates the cost of (1)global fetching of 'f, (2)extracting the information of the address and environment, (3)switching the environment, (4)an indirect jump.
By known function detection, tail recursive functions can be compiled to exactly the same code as loops in imperative languages do.
3) At the start of each function (with the exception of function 0, which is the top-level), the stack contains a  closure for the current function,  the continuation, and [2+] the Arc parameters. This is assured by the calling function.
4) Functions are passed around as closure structures. The first eleemnt of the closure structure is a non-encoded number, representing the case for that function, while the rest is simply an array of closure variables.
5) Functions simply use the stack for temporary scratch space. For example this is how (+ 1 2) would compile to:
6) Just prior to calling a function, the calling function pushes the parameters in order:  closure (the function to call),  continuation [2+] arguments. The number of elements N for the call is computed by the compiler
7) Then at the function call, the calling function copies the top N elements of the stack into the bottommost N elements, and assures that sp = &stack[N]. Then it sets the C-variable pc to the closure's function field, and does a C goto jump;
Well, I only have experience of writing direct-style compilers, not CPS-style ones, so my advice needs to be adapted.
But from mechanism of the current arc2c output you showed above, I see many places for improvement:
1)In a function:
(fn (x y z ...) (g A B C D ...)),
if B doesn't rely on x, C doesn't rely on x and y, D doesn't rely on x, y and z...etc, the calling function could avoid copying elements to the bottom. Instead, it moves the stack pointer to the bottom first, and then pushes the arguments.
2)For functions having no environments, we don't have to push a full closure, we just have to push pc.
3)For known functions, we just do a C goto jump not to the jump label, but to the (case n), because C cases are in fact labels.
Finally, in my opinion, a CPS-style compiler is no longer a better choice nowadays. It complicates the source, the debugging information and the (human) analysis of the program structure. Since we are already using a separate stack that is different to C's, continuations can be implemented in direct-style compilers as easily as in CPS-style ones. And codegen for direct-style compilers is just slightly more difficult, which isn't an issue. In addition, a naive direct-style compiler performs much better than a naive CPS-style one. The latter needs a source simplifying step to eliminate unnecessary closures and function calls produced by CPS conversion.
1) personally I think this is a rare case, but I could be wrong
2) arc2c closures are very lightweight: it's just a simple array of obj(s), with the first obj being the pc. So in effect for functions having no environment, we are pushing a pointer to the pc.
That said, closures are also used to represent functions that can be passed around. Unfortunately closures are currently untyped, so we expect the current closure style to be changed.
Also we need to support the possibility that a "function" being called isn't really a function: after all table syntax is just (tb key). And this is perfectly valid Arc:
(let sometable (table)
(each k lst
(= sometable.k (generate-something k)))
(map sometable ; yes, we're passing a table as if it were a function!
3) I was actually thinking of this too, although I haven't gotten around to it.
re: CPS: I wouldn't really know. Me, I'm just hacking around at the transformations before the CPS and Closure conversions. Because of the somewhat modular construction of arc2c, in theory you could write a drop-in replacement for CPS and Closure conversions, as well as code generator, and we can then put either CPS or the direct style as options, maybe.
A non-optimizing compiler leads easily to a "fast enough" executable. Without optimizations I think the compiled code would be 7x~10x slower than C.
Edit: I've tried the Fibonacci "benchmark" on a simple compiler i'm writing: it takes 0.2 seconds to compile the program and to compute the 32snd Fibonacci's number. On the current Arc interpreter it takes ~5 seconds.
Your compiler might be much slower if it's with true scheme numbers, + operator as a function(not a primitive operator) and stack overflow checking. These features are currently supported by the arc interpreter on mzscheme.
If you can correctly eliminate function calls on +, your compiler is an optimizing one, not non-optimizing...
I've tried the same example putting a function call and a test around every arithmetic operation, and execution time went from ~0.2s to ~0.26s, not a big difference, although a few optimization will probably be necessary for something more complex than fibonacci's example.
Is the function call overhead so small? I didn't realize.^^
But there are other issues: the fib example is not a very good benchmark suit, because in C, general recursion is not a common paradigm. If we compare C loops to Arc tail recursive calls generated by a simple compiler instead of comparing C recursions to Arc recursions, I believe that the difference will be much larger. Because C compiler writers have spent at least 20 years on optimizing loops...
That's absolutely true. Reaching C speed with high level languages such as Lisp it's very very difficult. CMUCL and SBCL reach roughly the speed of C, but they've been developed for a long time.
As of loops speed vs. tail recursion speed, the difference shouldn't be too big.
Stalin performs as better as C in numerical programs and many other benchmarks. The most exciting thing is that unlike CL, stalin doesn't need type declarations to guide optimizations. It would infer as much type information as possible. The problems is that it compiles too slow and it's not maintained anymore.
Naively implemented tail recursions is still not fast, because many common loop optimizations can't be directly applied to them unless you eliminate the function calls and regard them as true goto's. It's a rough task because the global flow analysis is needed for eliminating as many calls as we can.