> Btw, doctstrings are a real performance killer : they are useless, but are allocated on the heap and fill it up really quick (ah, recursive functions...
Are you saying that you alloc a docstring at every function call?
Well, for the moment, yes. Every object appearing in the program has to be allocated (it's not an optimizing compiler yet). Useless objects are not detected, so every time the compiler sees a string, it generates code to allocate it, and it is freed on the next GC cycle. Every time, you call the function, that code is executed. Well, that's an easy optimisation, so I'll work on it very soon I guess.
Yes, it's not difficult. You just have to find all constant values, create a global var for each constant, assign the const value to the global var and substitute the occurence of the constant with the global var name.
Const lists (created with quote) are improper scheme list, i.e. they end with nil instead of the emtpy list (in scheme nil != '()), whereas lists created with (list ...) are proper lists (ending with '()). This is certainly a bug.
I do have one question about loading libc. Since the system doesn't automatically find libc.so for you, you need to name the path explicitly. What worked for me was
The underlying foreign implementation is looking for the
library. The problem is that /usr/lib/libc.so is not the
library and dlopen() does not know how to handle this.
A better alternative with mzscheme's foreign interface is
to use `#f' for the library -- that treats the mzscheme
binary as the library to open (which includes the usual
libc stuff.)
Also, I think I need to use wait() to wait for a child process to terminate, but I am somewhat at a loss as to how to accomplish this, since the manual says it takes an int pointer... Again, any help would be appreciate.
Also, in your malloc() example, don't you have to explicitly free the memory afterwards? MzScheme's GC doesn't deal with memory explicitly allocated from C, does it? (I know the example itself uses only a tiny amount of memory, but still...)
Yes and no. I think mzscheme's GC can handle it without explicitly calling free at the right time, but you have to register a "finalizer" function to do so. That function will be called when the GC collects the object.
It does not know the size of malloced objects however, so be careful (no pb there, but if you allocate something very big, mzscheme will only see a new reference and will not necessarily call GC when you will actually lack memory).
1.
Hash tables gives you O(1) access with lesser constant times than red-black-trees, that coud need a few rotations after every insert/remove.
2.
It's better to use cons cells for list construction rather than vectors, because vectors have got an extra cell to store the length of the vector itself. On 32 bit machines a cons cell takes 8 bytes, a vector of 2 elements takes 12 byte i.e. 50% more than the cons cell.
2) In theory, the length of the vector could be stored by the memory manager anyway (possibly as a start and end address), which after all has to know how big to deallocate when the vector is GC'ed. In fact if the metadata is interspersed with the allocated memory it probably will be just a length, since the current address is known.
Macros should be easier to implement once the compiler is able to compile itself, because this way the compiler and the compiled macro have the same internal representation of data structures, so passing arguments between the two shouldn't be too hard.
There are several uses of macros in the compiler, unfortunately. In particular the 'def macro is too much of a convenience. So in order for the compiler to easily compile itself, it first has to implement macros. Chicken, meet egg.
Ah heck, maybe I should just use 'eval now and implement a compiled 'eval interpreter later that can interpret code and yet allow interpreted code to call compiled code and vice versa.
In fact I already have a bit of a sketch for this (which is necessary if we want to allow compiled programs to use 'eval). Basically put interpreted '(fn ...) forms into a 'interpreted-fn annotated type together with surrounding environment, add an entry to the 'calls* table (via defcall, say) for 'interpreted-fn to, say, a $$interpreted-fn-apply function which binds the parameters into an environment table and calls the 'eval interpreter.
Of course this requires some changes in the base system: we need at the very least a %symeval primitive which when given a symbol will give its global binding, a %symset primitive which will modify a symbol's global binding, and obviously we need a link from the symbol to the GLOBAL() array (and dynamically create new containers for created symbols - if it's not in the GLOBAL() array then the compiled code would never read that global anyway, only the interpreted code ever will).
The rest of the interpreter is just a standard scheme interpreter, the only real support we need is to be able to call compiled-from-interpreted and interpreted-from-compiled, and the reading and binding of global symbols, including those that aren't in the GLOBAL table.
Ouch, my head hurts. And sacado's the one doing the Unicode strings. LOL
Would it be worth implementing 'def directly? This would give a lot more functionality right away. This could be temporary until macros are implemented.
Possibly. There's a bunch of "macro" transformations in xe.arc, possibly I'm just a bit too lazy to think. However I don't like depending on those transforms, I want to do it "properly"
For the global vars problem, a solution could be to associate top level values directly with the symbol, this way a symbol would consist of three values: its string representation, its global value (initially a special 'unbound value') and a property list.
The current style has an optimization where all globals are simply referenced directly from an array in O(1). I'd rather that symbols point to entries in this array, because symbol-as-global-variable lookups are expected to be completely nonexistent if 'eval isn't involved in the program anyway (who uses 'eval in a language with 'read?). Only newly created symbols must have allocated variable values, and only for the benefit of 'eval'ed code - we can already know the global variables in the compiled code, because the compiler need that info anyway.
Basically:
struct {
long type; /*T_SYM*/
char* stringform;
#ifdef EVAL_USED
obj* binding;
#endif
} symbol;
int main(){
/*compiler generated only if eval is used*/
obj sym; symbol* sympt;
sym = SYM2OBJ("globalvar0");
sympt = (symbol*) sym;
sympt->binding = &GLOBAL(0);
sym = SYM2OBJ("globalvar1");
sympt = (symbol*) sym;
sympt->binding = &GLOBAL(1);
...
}
This way the current performance is retained (global variable lookups are O(1)).
I don't know how much this solution will be once support for a dynamic load (e.g. from the REPL) will have to be implemented, because you'll have to keep an index of the last global variables created across different compilation sessions. With threads it gets even more complicated (mutex on the index?). With symbols it would be simpler to implement a dynamic load or definition of a global var from the REPL. The price paid is a slightly slower access to global variables, because 2 references to memory are necessary for every refrence to a global var. Global variables lookups are still O(1) though, e.g: sym->binding for read access and sym->binding = value for write access.
> the last global variables created across different compilation sessions
I don't understand this part. I was proposing that 'eval would be an interpreter, not a compiler. My intentions was that compiled code would be statically generated (the way it's done now), so 'eval cannot possibly compile code. It would be a compiled interpreter of Arc. arc2c is a static compiler, so 'eval won't add ever add compiled code; the best it can do is create a 'interpreted-fn object that contains an interpreted function's code (as a list) and the enclosing interpreted environment
So a dynamic load would just interpret the expressions in the file being loaded:
(set load
(fn (f)
(w/infile s f
(whilet e (read s)
(eval e)))))
'eval would be able to access the global variable table indirectly via the symbols and %symeval/%symset.
Basically, 'eval would be compiled to something like this:
(set eval
(fn (e (o env nil))
(if (isa e 'symbol)
(if env (lookup-environment env e)
(%symeval e))
(...))))
Also: if the compiled code doesn't reference it, it won't be in the GLOBAL() array. The reason is simple: the compiled code won't reference it, ever. If 'globalvar isn't in GLOBAL(), then it does not exist in the compiled code. So it doesn't matter that it's not in the GLOBAL() array - the compiled code never referenced that global, so it won't ever use an index into the GLOBAL() array to refer to it. The interpreted code might, but that's why we have an indirect reference connected to the symeval.
Also, when I say O(1), I mean O(1) with the number one, as in only one layer of indirection (an index within a table). If global bindings are kept with the symbol only, then all global accesses - even precompiled ones - need (1) to find the symbol and (2) get the binding, for a total of O(2).
In other words: 'compile-file compiles, but it creates a new executable which is never connected to the process that ran 'compile-file. 'eval just interprets, and if the interpreted code mutates a global of the program, then that global gets mutated for real, in the program (what are you doing using 'eval on untrusted coe anyway). But if the interpreted code mutates a global that is never used in the program, it just creates a new global variable, one which is never referenced by the program (by definition, because the program never used it).
I thought eval compiled code, loaded it and then executed it. I've been mistaken. With the compiled code completely static then your strategy is better than assigning values to symbols.
I'd like to have access too. Currently I don't have much free time to contribute, but if I find some time i'll be glad to contribute. My github username is, surprisingly, 'stefano'.
A non-optimizing compiler leads easily to a "fast enough" executable. Without optimizations I think the compiled code would be 7x~10x slower than C.
Edit: I've tried the Fibonacci "benchmark" on a simple compiler i'm writing: it takes 0.2 seconds to compile the program and to compute the 32snd Fibonacci's number. On the current Arc interpreter it takes ~5 seconds.
Your compiler might be much slower if it's with true scheme numbers, + operator as a function(not a primitive operator) and stack overflow checking. These features are currently supported by the arc interpreter on mzscheme.
If you can correctly eliminate function calls on +, your compiler is an optimizing one, not non-optimizing...
I've tried the same example putting a function call and a test around every arithmetic operation, and execution time went from ~0.2s to ~0.26s, not a big difference, although a few optimization will probably be necessary for something more complex than fibonacci's example.
Is the function call overhead so small? I didn't realize.^^
But there are other issues: the fib example is not a very good benchmark suit, because in C, general recursion is not a common paradigm. If we compare C loops to Arc tail recursive calls generated by a simple compiler instead of comparing C recursions to Arc recursions, I believe that the difference will be much larger. Because C compiler writers have spent at least 20 years on optimizing loops...
That's absolutely true. Reaching C speed with high level languages such as Lisp it's very very difficult. CMUCL and SBCL reach roughly the speed of C, but they've been developed for a long time.
As of loops speed vs. tail recursion speed, the difference shouldn't be too big.
Stalin performs as better as C in numerical programs and many other benchmarks. The most exciting thing is that unlike CL, stalin doesn't need type declarations to guide optimizations. It would infer as much type information as possible. The problems is that it compiles too slow and it's not maintained anymore.
Naively implemented tail recursions is still not fast, because many common loop optimizations can't be directly applied to them unless you eliminate the function calls and regard them as true goto's. It's a rough task because the global flow analysis is needed for eliminating as many calls as we can.
This strategy seems viable, but it seems slow, because there are a lot of checks at every thread switching. If we could manage to transform every Arc function into a C function (actually an array containing the C function address and the closed vars) we could then rely on the pthread library for multi-threading, or the equivalent threading library available on the host system.
The problem with transforming each Arc function to a C function is the stack space consumed at each call. Sure, gcc does tail call optimization, but not all compilers are gcc.
Chicken fixes this by keeping track of stack and GC'ing the stack when it's full; I'm not sure about Bigloo but I hear it's got a pretty good Scheme function == C function equivalency, so it might be useful to see how they fixed the tail call optimization problem there.
That said, the current execute() function accepts a pc argument, which specifies which Arc function it begins with. It may be possible to pass the pc to go to together with a new stack, but I don't know pthreads.
I think you could trust every decent C compiler to do tail recursion optimization, but if you want full compatibility then the mapping Arc fun -> C fun doesn't work. pthread requires you to pass a pointer to a C function i.e. an adress where to jump. We could wrap every thread within a C function and leave all the other functions as they are currently implemented. The C function would just call execute() with the right parameters. If execute() doesn't use global vars, there will not be race condition problems.
Well, I've started to import bits of execute from globals to locals. However I do have access to a few bits of global variables, specifically the quoted-constants array (those created by 'foo and '(a b c d), etc.); this table is initialized at startup (before the first call to execute). I would suppose this read-only array would be okay to access?
As for wrapping them in C functions: the problem is that the most basic Arc threading function isn't 'thread, it's 'new-thread, which accepts a function as input. 'thread is defined as:
(mac thread body
`(new-thread (fn () ,@body)))
In theory, new-thread could be called with any function:
Sure, it won't happen most of the time, since most people will sensibly use the simpler 'thread, but exploratory, exploratory...
It would be possible to implement if pthreads or whatnot can pass even just a single pointer to the newly-threaded function, but if it can't (why not?) then our alternative is to create a bunch of C functions for each Arc function, which just calls the execute() function with the correct number.
---------
Edit: okay, I did a little eensy-weensy bit of research on pthreads, and it seems that pthreads can pass a pointer to the called C function.
This could work. As for global variables, if they are read-only, then there won't be any problem. What happens if you load in parallel two different modules (with their constants)? Maybe loading a file should be made atomic, or at least a part of it, such as constant values initialization.
Using pthreads might have some benefits but speed is not a reason for choosing pthreads. On Linux systems that maps pthreads onto kernel threads I've seen a two order of magnitude slower execution compared to an user space thread library. In general I'd say that if I/O is important then using pthreads would make sense but if speed is important then green threads is preferable.
I thought pthreads were fast, but anyway on computers with more than one core/processor, which are very common, green threads don't distribute the threads between the cores, because this is the OS's duty, and it doesn't know anything about green threads.
Hmm. After a bit of research, it seems that some VM's handle this by using pthreads a few times (presumably one per core), then having each pthread spawn several green threads.
This seems a good solution if we can arrange to spawn a new pthread also for blocking I/O, maybe by catching I/O functions and putting into a pthread the I/O operation caller. This hybrid would make the compiler code not very clean, because it should handle both green threads and pthreads.
1) I use a new structure, the sharedvar, which doesn't correspond to any Arc structure. See the other post which had me blinking stupidly at stefano's transformation before I realized how cool it was: http://arclanguage.com/item?id=5784
2) This new structure is untyped (i.e doesn't have type tags, unlike the pair and symbol structures). I intend to make shared variables "seamless" to the upper Arc though, so this should be fine...
3) My original name for this structure was closure-settable, which I shortened to sharedvar instead.
Making the new structure untyped shouldn't create problems, but it could make development difficult: if you have a bug in the transformer or in something related to it, you'll probably get a Seg. fault error instead of a clean "Not a sharedvar" error. The check could then be removed when the code is released.
Hmm. So far the transformation seems correct anyway; and really, the transform is quite simple and appears mathematically correct. Also, because of the way the compile-file driver is structured, it would be possible to remove/replace each individual transform.
Which reminds me, we need to have a proper error continuation too.
The base idea is pretty good, although the 'cons cell (which is composed of a type id, a car pointer, and a cdr pointer) can be replaced with a smaller "closure-settable" cell, which would be an untyped object composed of a single pointer:
(fn (x)
; % means it's a primitive
(let x (%closure-settable x)
(list
(fn () (%closure-settable-read x))
(fn (v) (%closure-settable-write x v)))))
Because the closure variable abstraction is not accessible from the high-level Arc language, the closure-settable doesn't need to be typed tagged