Arc Forumnew | comments | leaders | submitlogin
arc-to-c : soon on the git
13 points by sacado 5864 days ago | 65 comments
Hello all, during my long week-end I finally did it : I implemented a crappy version of a compiler from Arc to C (written in Arc of course). Only a very small subset of Arc is implemented (at least, I can compile the fib function, but it crashes after (fib 30) :) and it is still quite slow (a little faster than non-jitted mzscheme however) but I think it is a good starting point. It knows fixnums, (with +, -, x, < and is), conditionals and closures (with lexical scoping and continuations). It is based on Marc Feeley's tutorial on compiling scheme to C. I gave the link a few days ago.

Now, have to be implemented :

- bignums, rationals, inexact numbers and complex numbers (and all their primitives)

- dotted pairs, symbols, strings, hash tables

- special syntax (and the reader to parse it)

- oh, did I mention the GC ?

Once we've got that and a few more primitives, arc.arc should be compilable and everything should be working, but that's only once we've got that. The next step is then the optimizing compiler.

Who's looking for a Google SoC project already ?



3 points by almkglor 5863 days ago | link

Idea: error handling and trapping

1) We could add a "fail" continuation which is called upon error. By default the "fail" continuation will be the same fail continuation as the caller, but for forms such as 'errsafe and 'after, we replace the "fail" continuation.

2) Alternatively, we use a global (or thread-local...) variable to store the current fail handler. Then 'errsafe and friends will store the current value of the fail handler in a local, replace it with its own, and then run its subforms; afterwards (whether it returned via fail or normal continuation) it restores the previous value of the fail handler.

Idea: macros

1) We could piggyback off the existing Arc and just use repeated 'macex until we reduce down to 'fn and 'if and function calls.

1.1) Then for metacompilation, we could create two levels of Arc: an interpreted Arc and a compiled Arc. Macros are done in interpreted Arc while full programs are compiled. Basically interpreted Arc would simply be an 'eval for Arc, in Arc.

Idea: Arc built-in library

1) In the sample we already have a "built-in" which is in fact added to the source to be compiled: ccc or call/cc. We could extend/generalize this to include a set of built-ins that are added to the source to be compiled, maybe like so:

  ; assuming car and cdr are true built-ins
  (def-arc2c-builtin map1 (k f l)
    (if l
        (map1
          (fn (rest-v)
            (f
              (fn (v)
                (cons k v rest-v))
              (%car l)))
          f (%cdr l))
        (k nil))

-----

2 points by eds 5862 days ago | link

> In the sample we already have a "built-in" which is in fact added to the source to be compiled: ccc or call/cc. We could extend/generalize this to include a set of built-ins that are added to the source to be compiled.

I suppose the set of functions to be included would be those in arc.arc and libs.arc, in which case you would just compile arc.arc and libs.arc before compiling the target file, right?

-----

2 points by almkglor 5862 days ago | link

Basically, ccc is handled by detecting if it is used. If it's not used, it's not inserted. When I was talking about extending this, this is what I was referring to: adding code which is defined as "inserted if used".

Simply inserting the entire arc.arc code will add bloat, because most programs don't even use most of the arc.arc code. Given the level of documentation of arc functions (arcfn.com notwithstanding), it is more than likely that the user will not use arc.arc code. So it's better if we simply insert the code if it is used.

Also, it may be better to use arc2c specific code, to take advantage of certain peculiarities in how code is generated. For example, if you decide to use unrolled lists, map1 and friends are better off allocating the full list and then iterating over the values.

-----

1 point by eds 5861 days ago | link

This is fine if the compiler is static like Stalin, but if you omit parts of arc.arc that aren't used, you run the risk of not being able to deal with code known only at run time (e.g. (loop:print:eval:read)).

-----

2 points by almkglor 5861 days ago | link

And if 'eval is an interpreter?

Come now, if you want real support for eval, you'll also need to include the compiler:

  (eval `(fn (x) ,@my-variable)) ;how will it build the function?
The alternative is to punt: if there's ever an 'eval, then add arc.arc completely, make 'eval an interpreter which somehow uses 'symeval to lookup globals (and execute global functions as compiled functions), and when it encounters a function, will build the function as an interpreted function (and obviously allow interpreted code to call compiled functions and vice versa).

Basically you create a virtual function:

  (def eval (e (o env))
    (if
      (caris e 'fn)
        (add-attachment
          'environment env
          (annotate 'virtual-function
            (cdr e)))
      ...))
  (defcall virtual-function (f . args)
    (with (env (get-attachment 'environment)
           (arglist . body) args)
      ; has to be nondestructive
      (zap add-args-to-environment env arglist args)
      (each e body
        (eval e env))))

-----

1 point by sacado 5862 days ago | link

For macros, that what I thought too. Anyway, we'll need eval for completeness. I don't know how it will behave with compiled code, though, and I know it's not an easy problem as stalin scheme simply ignored it.

For the other points, I didn't think about it yet :)

-----

4 points by almkglor 5863 days ago | link

> - special syntax (and the reader to parse it)

Actually, I don't recommend that the reader parse it; pg's code specifically parses special syntax after the reader. This also means that:

  (mac foo (x)
    (string x))
  (foo arf:narf.oops)
  => "arf:narf.oops"
Note that I've actually taken advantage of this fact; my w/html macro, in order to support a better tag class attribute syntax, will process tags of the form div.class#id, which will break if the reader auto-transforms this to (div class#id)

code is spec (grumble) code is spec... (grumble)

-----

1 point by sacado 5863 days ago | link

Hmm... Looks like you're right, again... Well, anyway, we still need a reader and a way to parse the special syntax :)

-----

3 points by almkglor 5863 days ago | link

If you're interested, the Anarki lib/arcformat.arc is actually a start at parsing Arc syntax; we just have to change the filters so that instead of emitting html, we emit Arc objects wrapped in singleton lists. It uses treeparse.

treeparse is pretty good, and I'm sure a full Arc parser can be built on treeparse.

Looks like I need to speed up writing my treeparse tutorial ^^

As for parsing special syntax, an Arc version of ssyntax would work: we just need a symbol->string conversion (meaning full 'coerce support) and seek ~!.: in the substring.

edit: how did you handle continuations and tail call opts?

-----

4 points by sacado 5863 days ago | link

Ah, I'll have a closer look at arcformat then. Might be good to have a reader :)

Continuations & tail-calls are treated the way Steele proposed it in the lambda papers.

The following code :

  (set f (fn (n)
     (if (< n 2)
          n
          (+ (f (- n 1)) (f (- n 2))))))

  (prn (f 30))
is first translated so as to have unique names for variables and so that primitive operations are prefixed with a % (we won't treat them the same way as other operations in the next steps) :

  (do
    (set f (fn (n@1)
      (if (%< n@1 2)
        n@1
        (%+ (f (%- n@1 1)) (f (%- n@1 2))))))

  (%prn (f 30)))
Then we perform continuation-passing-style transformation : each function call is given, as an extra parameter, the portion of code (the continuation) it will send its result to : functions do not return, they call another function with the value they calculated :

  (let ((r@5 (fn (k@6 n@1)
    (let ((r@7 (%< n@1 2)))
      (if r@7
        (k@6 n@1)
        (let ((r@11 (%- n@1 1)))
          (f (fn (r@8)
            (let ((r@10 (%- n@1 2)))
              (f (fn (r@9) (k@6 (%+ r@8 r@9))) r@10))) r@11)))))))

  (let ((r@3 (set f r@5)))
    (f (fn (r@4)
      (let ((r@2 (%prn r@4)))
        (%halt r@2))) 20)))
Finally, we perform closure conversion so as to only perform gotos (i.e., no function calls) :

  (fn nil
    (let ((r@5 (%closure
                       (fn (self@12 k@6 n@1)
                         (let ((r@7 (%< n@1 2)))
                           (if r@7
                             ((%closure-ref k@6 0) k@6 n@1)
                             (let ((r@11 (%- n@1 1)))
                               ((%closure-ref f 0)
                                 f
                                 (%closure
                                   (fn (self@13 r@8)
                                     (let ((r@10 (%- (%closure-ref self@13 2) 2)))   
  ((%closure-ref f 0) f (%closure (fn (self@14 r@9) ((%closure-ref 
  (%closure-ref self@14 1) 0) (%closure-ref self@14 1) (%+ 
  (%closure-ref self@14 2) r@9))) (%closure-ref self@13 1) r@8) 
  r@10))) k@6 n@1) r@11)))))))) (let ((r@3 (set f r@5))) 
  ((%closure-ref f 0) f (%closure (fn (self@15 r@4) (let ((r@2 (%prn 
  r@4))) (%halt r@2)))) 20))))
(Sorry for the indentation, I don't feel like indenting this by hand, but you get the idea I guess ?)

Generating C code from there is then easy. It is almost only Marc feeley's code, based on Steele's ideas a few years ago. I only changed the code so that it understands Arc code and primitives instead of Scheme's...

-----

1 point by almkglor 5863 days ago | link

I assume that for code blocks 3 and 4, 'let is the Lisp 'let, not the Arc one? (let ((var val)) ...) not (let var val ...)?

Also on code block 1 you show (prn (f 30)) but in code block 2 you seem to pass (f <continuation> 20)? I assume this is just a mistake?

I managed to understand the transformation up to code block 3, but not code block 4; what does '%closure do? Allocate heap memory for the closure? Also, what's the syntax for "goto"?

Finally: how about the equivalent C code?

LOL, maybe I should just wait for you to push it on the git and experiment with it myself.

-----

2 points by almkglor 5863 days ago | link

Aruu, okay okay I finally actually looked at the presentation docs, which I probably should have looked at first. One of the things that threw me off was 'self - I was thinking of Arc's sense of 'self as in 'afn!

The output C code looks suspiciously like assembly language to me. Perhaps we can also target a temporary assembly syntax so that we can do minimal peephole opts, such as convert stuff like PUSH(x); y = TOS(); to MOVE(x,y);. Can't wait to actually see this code on the git ^^.

P.S. Given that the transformations are (gasp!) syntactic, it might actually be possible to implement the entire compiler as (gasp gasp!) a treeparse parser (or at the very least a piped chain of treeparse parsers) ^^.

-----

2 points by sacado 5863 days ago | link

Maybe treeparse would be the right thing to use... The code is getting uglier everytime I try to add a new primitive / special form... I dunno...

As for the generated code, yes, it's a lot like a portable assembly code. There are certainly easy optimizations to perform on it, but as for now, it's working and that's a lot :)

And yes, let is the traditional one -- with tons of parens everywhere.

-----

2 points by almkglor 5863 days ago | link

I'll go through the code later to see what can be done. Certainly the AST looks representable as plain lists to me, although I haven't fully analyzed it yet.

As an aside compile-file could be restructured like the following:

  (def compile-file (filename)
    (compile-ast (parse-file filename) (+ (strip-ext filename) ".c")))
  ; to allow programmatic access
  (def compile-ast (ast dest)
    ; chain of conversions
    (let chain
         (list
           (list cps-convert "CPS-CONVERSION")
           (list closure-convert "CLOSURE-CONVERSION"))
    ; do reduction
    (let final-ast
         (reduce
           (fn (ast (f desc))
             (let new-ast (f ast)
               (prn "----------------- AST after " desc)
              (prn (source new-ast))
               new-ast))
           chain ast)
        (prn "-------------------- C Code:")
        (w/outfile f dest
          (w/stdout f
            (prn:liststr:code-generate final-code))))))
This should allow easy insertion of any steps (e.g. optimization steps) along the way.

In fact the chain list should probably be better use `(,), so that we can support flags or suchlike for optimizations:

   `(
       (,cps-convert "CPS-CONVERSION")
       ,@(if optimize
             `((,optimize-after-cps "CPS-OPTIMIZATION")))
       (,closure-convert "CLOSURE-CONVERSION")
       ,@(if optimize
             `((,optimize-after-closure "CLOSURE-OPTIMIZATION"))))

-----

2 points by binx 5861 days ago | link

Just leave the peephole stuff to gcc, it almost always does better than handcoded optimizer.

The CPS transformed code can be arbitrarily inlined, so a simple inliner without flow analysis can give you much efficiency for free.

-----

2 points by almkglor 5861 days ago | link

And if the target isn't gcc?

For that matter my concern is with the expansion of PUSH and POP:

   PUSH(x); y = POP();

   =>

   *sp++ = x; y = *--sp;
Can gcc peephole the above, completely eliminating the assignment to the stack (which is unnecessary in our case after all)?

   y = x; //desired target
Without somehow informing gcc to the contrary, gcc will assume that writing to * sp is significant, even though in our "higher-level" view the assignment to * sp is secondary to transferring the data between two locations.

-----

4 points by sacado 5861 days ago | link

Actually, I tried the above (tuning generated code so as to change something like :

  BEGIN_JUMP(3); PUSH(LOCAL(5)); PUSH(LOCAL(6)); PUSH(LOCAL(7)); END_JUMP(3);
to its semantic but obviously much faster equivalent :

  memcpy (stack, stack + 5, sizeof(obj) * 3); sp = stack + 3; END_JUMP(3);
and

  PUSH(x); if(POP)
to

  if (x)
Well, with full optimizations on gcc (-O3), it doesn't change anything (at least in execution time, I didn't compare generated machine codes). Wow, gcc seems really clever. Now that I know how hard it is to implement a compiler, I can only applaud :)

-----

1 point by almkglor 5861 days ago | link

WOW. gcc must be real good then ^^.

-----

3 points by almkglor 5862 days ago | link

Okay, I'm reviewing ac.scm a bit.

Basically the only thing that can't be redefined on Anarki are:

  quote
  quasiquote
  if
  compile
  fn
  set
  lset
This also means that everything else can be redefined - including +, -, * , /, is, isnt, car, cdr, scar, scdr, sref... Sure, it's not, exactly, "recommended", but exploratory, exploratory...

For instance, scanners redefine car and cdr, and settable-fn's redefine sref.

I propose instead that the compiler detect if the program actually redefines + - * / is isnt etc. If they aren't (which will be 99.9999% of the time) then the compiler can treat them as builtins. Otherwise, the compiler must treat them as globals and insert their definitions into the code (the same way ccc is inserted); then when the program redefines it (probably using (let old car (= car (fn (c) (prn "foo!") (old c)))))), the "builtin" is stored as a function in a global, which the program can refer to and replace.

-----

3 points by sacado 5861 days ago | link

Yep. For the moment, I'm treating them as builtins. They will have to be implemented the way you mention it, that's for sure. I guess the easiest way would be to declare them both ways : as primitives (i.e., + translated to %+ when it is met) and as built-in functions, the way ccc is defined.

-----

3 points by almkglor 5861 days ago | link

Hmm, that's true, you do need to handle stuff like:

  (map + (list 1 2 3) (list 9 8 7))
Haruu, compiling dynamic languages is hard... ^^

For that matter, it shouldn't be so difficult - basically if the compiler ever sees (set + ...) anywhere (where + is a global, at least), it disables the use of %+.

Possibly, we could add another step in the transformation process (which is why I suggested the structure in http://arclanguage.com/item?id=5598 to allow easy insertion/deletion of steps).

Basically the new step just traverses the structure (without mutating it) and creates a list of built-ins to disable.

Edit: or better - it traverses the structure and replaces + with %+, until it reaches a top-level node which has (set + ...), so that code which doesn't use the redefined + will still refer to the builtin %+.

-----

2 points by sacado 5863 days ago | link

It's on the git now. It now includes fixnums and symbols.

To use it, load arc2c.arc, then run (compile-file "foo.arc"). It will then show you all the transformations it performs and generate a file, foo.c. Compile it with your favorite C compiler, and you're done.

Are currently implemented :

- +, -, *, <, >, <=, >=, is, isnt, prn on fixnums - is, isnt, prn on symbols (including t and nil) - let, if, fn, set, and, or, quote.

Basically, that's all. A few warnings : if you call a non-existant function, your program will crash without a warning. And remember there is no GC. Once you run out of pre-defined memory, the program dies.

-----

2 points by binx 5862 days ago | link

GC is not quite a problem, just use boehm's GC and replace all malloc's with GCmalloc's, and delete all free's then it just works.

The main problem is about arc's macros. They're quite different to cl's because they're first-class, and you have to invoke the compiler frequently when running the code. So the target C code should obtain the source information for macro expansion.

-----

4 points by sacado 5861 days ago | link

Hey, by the way, I have a working GC now. It's terribly slow, but it's my first one, so I like it anyway. Yes, I know I should have used Boehm (no time lost to implement a slow, buggy, informally-specified implementation of a GC) but, after all, I don't play implementing compilers every day, so... It will be changed when it will need to (when Google SoC will be over, we'll have a new GC !)

-----

1 point by almkglor 5861 days ago | link

LOL.

Code or it didn't happen ^^. Would like to see it for hacking through too.

-----

1 point by almkglor 5862 days ago | link

Arc's macros are not perfectly first-class: they expand only if they're defined in the global namespace at the time the code is compiled. Try this:

  (mac my-mac () t)
  (def foo ()
     (my-mac))
  (mac my-mac () nil)
  (foo)
or even this:

  (def foo ()
    (my-mac))
  (mac my-mac ()
       '(prn "foo!"))
  (foo) ;this last will also depend on whether you're on Anarki or ArcN
So basically, you need to determine if a global would be assigned with a macro somehow, save the macro-code, and at macro-expansion time detect the macros and execute the macro-code in, say, an Arc interpreter.

-----

1 point by binx 5862 days ago | link

In principle, whether a global variable will be assigned with a macro is impossible to detemine. Arc doesn't distinguish the run-time global environment and the expand-time global environment, so you still have to carry the source information to keep the semantics of Arc-n...

-----

3 points by almkglor 5862 days ago | link

Yes you can - remember that macros must be assigned to global variables before they are used.

So what we do is, we use our own 'eval (this should be easy, this is a lisp after all). However, this 'eval's 'set must determine if it's assigning to a global or a local (again, this should be easy, since it has to know what to write where). Then it checks if a global assign is that of a macro.

For each top-level form, we 'eval the form and check if any global assign is a macro assign. Now we know if a macro has been assigned to a global. Then we compile the form.

The alternative is simply to check if a form has 'mac anywhere in it. If it has (mac ...) or (annotate 'mac ...) anywhere, we run this in 'eval instead of compiling that form.

Really, though, we don't need to keep the semantics of Arc completely. We can add a few rules for macro writers: Always use 'mac. Don't use anything other than the built-in functions, or if you really need a non-builtin function for your macro, then put it in a 'let form together with your macro, and put two copies of the 'let form.

Alternatively, we could just make a REPL for Arc, in Arc, write it in a form suitable for compilation, and then write the compiler in a form compatible with the REPL and arcN.

      ___
     v   \
  REPL    compiler
     \___^
Besides: having some macros is better than having no macros.

-----

1 point by binx 5862 days ago | link

You can't eval each top-level form in compile-time. They may have side-effects, may expect user inputs, and may terminate in weeks(for example, a program doing ray-tracing or nuclear-explosion simulation).

Explicitely distinguish expand-time and run-time is necessary for real static compilation. If you don't want to distinguish them, you are inevitably forced to do compilation in run-time.

-----

1 point by almkglor 5862 days ago | link

> You can't eval each top-level form in compile-time. They may have side-effects, may expect user inputs, and may terminate in weeks(for example, a program doing ray-tracing or nuclear-explosion simulation).

And you're compiling (not executing) a top-level form that does those things?

All right then - make the limitation that macros must be defined in their own top-level forms, and if a top-level form ever contains 'mac in a function position, or (annotate 'mac ...), then it's executed and any macros it assigns to globals are treated as such: macros. If it terminates in weeks, too bad.

Alternatively just write an interpreted REPL and include the ability to compile code, but not include macros in compilation - macros must be loaded into the REPL, then the actual runnable code is compiled (and the compiled code's macros are ignored).

Anything just to get macros.

-----

1 point by sacado 5862 days ago | link

Actually, GC is more complicated than I thought : references to objects are not always pointers (i.e., addresses to something allocated on the heap). Fixnums, for example, but also t and nil, are a special kind of references (for fixnums, the last bit is 0, indicating "the actual value of the x reference is x >> 1". And these values have to be collected from time to time. The fib example runs out of memory quite fast, without a malloc since it's only using fixnums.

I started writing a naïve one yesterday (you know, the kind of things that were state of the art 50 years ago) and it's a little easier than I expected. At least, it doesn't core dump anymore. I'm in infinite loops now :)

-----

2 points by binx 5862 days ago | link

Boehm's GC is conservative, which means it treats all the values on stack as pointers. So there may be some space leak, but it's acceptable. At least many compilers-to-C use boehm's GC, and many of them are in practical use now.

-----

2 points by sacado 5862 days ago | link

Yes, but the problem is that even the references that are actual pointers do not really hold a pointer address, as this address should not end with a 0. You would do something like :

  ref = (malloc (sizeof (cons-cell)) << 1) + 1;
That way, ref knows it is pointing something (as ref & 1 == 1), it knows the address to this thing (it is ref >> 1), but this is totally invisible to Boehm's GC, I suppose. I guess it would think "hmm, this address 01001000 does not seem to be pointed by anyone. The only pointer I can see around is 10010001. Not the same thing obviously. Let's free 01001000 !"

But maybe I'm missing something, I never used Boehm except in toy apps.

-----

3 points by binx 5862 days ago | link

Oh, in order to use boehm's GC, you should store references as normal pointers. And you have to tag the lowest two bits of fixnums and immediate consts with 01/10/11.

Another solution is to write your own object memory, allocation functions and GC yourself.

-----

3 points by almkglor 5862 days ago | link

This is true; we'll have to change tags so that real objects are actual pointers. The alternative is to make everything a pointer, i.e. even "numbers" are really pointers to numbers.

I think it's easier to swap the meaning of numbers with objects rather than build our own GC.

As for infinite loops in GC - do you make sure not to scan a memory area you've already scanned?

-----

2 points by binx 5862 days ago | link

Yes, and if he wants to write a GC, I advise him to start with a semi-space copying GC, which is simpler and has no recursions. By this you don't have to preserve an extra bit for mark-and-sweep, and you need no GC stacks for deep-first traversing.

-----

1 point by sacado 5862 days ago | link

I don't really like making everything a pointer. I don't know any Lisp implementation working this way (even the slowest ones). It really implies a performance hit every time you use a small integer value (with -2G < small < 2G). Anyway, binx's idea seems to be good...

Well, about infinite loops, I don't know, I stopped yesterday when my head was starting to burn... And gcc always compaining about pointers being of the wrong type (the way it is implemented, with tagged references, I have to cheat a lot). Oh, come on, gcc, who do you think you are, an Ada compiler ?

-----

4 points by binx 5862 days ago | link

You don't need to make everything a pointer, but you can make everything a structure of 12 or 16 bytes. Lua takes this approach and it performs rather well.

Here's the Lua way:

struct obj{ int type; union data{ double num; void *ptr; }; }

By this means, you can achieve architecture independence.

-----

3 points by almkglor 5862 days ago | link

This approach is the best. In fact this has to be done, because not all computers have sizeof(int) == sizeof(void* ); certainly my AMD64 running on a 64-bit kernel doesn't.

Note that we can actually abstract away the representation of objects from the actual binary bits that represent it. For instance I would recommend that objects use the following representation (as noted by binx):

  typedef struct s_obj{
    enum e_type {
      other = 0,
      number = 1,
      symbol = 2,
      character = 3
    }
    type;
    union u_data{
      void* other;
      int number;
      void* symbol;
      Uchar character; //unicode character
    } data;
  } obj;
  #define OBJ_TYPE(o) ((o).type)
  #define OBJ2FIXNUM(o) ((o).data.number)
  #define FIXNUM_MIN INT_MIN //requires glibc
  #define FIXNUM_MAX INT_MAX
  //others defined similarly
However, for machines with the following characteristics:

1. 32-bit pointers and 32-bit int's

2. malloc() always returns an address at a 32-bit boundary (i.e. every four bytes, or with the lowest two bits zero)

We can then use:

  typedef int obj;
  #define OBJ_TYPE(o)  ((o) & 0x03)
  #define OBJ2FIXNUM(o)  ((o) >> 2)
  #define FIXNUM_MIN (-(1 << 30))
  #define FIXNUM_MAX ((1 << 30) - 1)
Thus we are able to have portability while still gaining optimizations for some architectures.

-----

1 point by sacado 5861 days ago | link

In fact, maybe it's not a problem even on machines where ints are not as big as pointers. You can treat each value as an int pointer, and, when you need a fixnum, do

  (int) my_int_ptr << 1
If ints are bigger than pointers, that's okay : you loose possible bits, but at least it's working (for example, if ints are on 6 bytes and addresses on 4 bytes, only the 4 least significant bytes are useful, the 2 other ones are lost). If pointers are bigger than ints, it's still working, but this time lost space is in their pointer representation.

The only thing is that fixnums don't have an architecture-independant size, but that's not a problem in practice anyway. The only constraint is that the biggest fixnum is

  2**(max(sizeof(int), sizeof(int*)) - 2)
The last-bit-is-1-for-fixnum trick works on every architecture except machines where addresses are on 8 bits. Well, maybe we can ignore these ?

-----

3 points by almkglor 5861 days ago | link

Actually it is, arc2c output crashes on my AMD64. I'm trying to hack out a replacement, but it's not working yet.

The problem is that apparently the bits of pointers that happen to be beyond the bits of ints are not zero, so assigning a pointer to an int chops off the significant bits.

-----

1 point by sacado 5860 days ago | link

Oh... Maybe a simple fix would be to change int to long. I know this is not standard behavior, but in practice I think long and pointers are always the same size. mzscheme obviously works that way : every object is a pointer that you can cast to a long if it is a fixnum.

-----

2 points by almkglor 5860 days ago | link

This seems to be correct on my AMD64:

  #include<stdio.h>
  
  int main(void){
        printf("sizeof(int) = %d\n", sizeof(int));
        printf("sizeof(long) = %d\n", sizeof(long));
        printf("sizeof(void*) = %d\n", sizeof(void*));
        return 0;
  }

  sizeof(int) = 4
  sizeof(long) = 8
  sizeof(void*) = 8
However I think there may be processors/archs where sizeof(long) != sizeof(void* ). My attempt at fixing was to use a union, but something's wrong with the way closures are handled in my fix attempt - closures don't seem to have a type associated with them, so I'm not exactly sure how they're supposed to be done.

Edit: I'll probably need to search through C language specs, though - I'm not sure, but it might be standardized that sizeof(long) >= sizeof(void* )

-----

2 points by absz 5860 days ago | link

What you want are the two typedefs intptr_t and uintptr_t . Each one is defined to be an integer big enough to hold a pointer (the former being signed, and the latter being unsigned); they aren't mandatory, though, so you might have

  #if defined(__COMPILER1__) || defined(__COMPILER2__)
    typedef intptr_t fixnum;
  #else
    typedef long fixnum;
  #endif
This would guarantee you correct behaviour if intptr_t were defined, and probably give you the correct behaviour even if it weren't.

-----

1 point by almkglor 5860 days ago | link

From this?

  #include<stdint.h>
Hmm. Interesting. How well supported is it?

-----

1 point by sacado 5860 days ago | link

Thanks for the information ! I'll try it.

-----

1 point by stefano 5862 days ago | link

Making a fixnum a pointer to a allocated memory containing the value is terribly slow.

The infinite loop could be caused by two objects that references each other, e.g. a references b and b references a, so the GC keeps going from a to b and from b to a. To solve the problem, if this is the problem, if an object has been already marked, don't follow it.

Note: I've not read the source yet.

-----

1 point by sacado 5862 days ago | link

Hmm... That would work because addresses are on 4 bytes, so they always end with 00, right ? That's a nice idea :) But it wouldn't work on an architecture with addresses on less than 4 bytes, no ? There are not many nowadays, I guess, but I suppose that should be added as an assertion is the code...

-----

0 points by stefano 5862 days ago | link

You have to be sure that every allocated object is at 8 byte boundaries for the pointer to end with 00.

-----

1 point by ambi 5853 days ago | link

Do you know ALL_INTERIOR_POINTERS? In gc/doc/README:

"Any objects not intended to be collected must be pointed to either from other such accessible objects, or from the registers, stack, data, or statically allocated bss segments. Pointers from the stack or registers may point to anywhere inside an object. The same is true for heap pointers if the collector is compiled with ALL_INTERIOR_POINTERS defined, or GC_all_interior_pointers is otherwise set, as is now the default."

-----

2 points by eds 5864 days ago | link

If you used something like CLN (http://www.ginac.de/CLN/) then you could get the full numeric tower basically for free. But perhaps you'd rather implement numbers in Arc instead.

As you probably know, I submitted a GSoC proposal for CL-Arc. I could write up a quick proposal for Arc-C, but there are only two more days before applications are due. Also, someone from the Arc community would need to apply to mentor the project in order for it to be funded.

-----

1 point by almkglor 5863 days ago | link

I've taken a look at CLN and it seems to be a C++ library. I suppose this means we need to create some sort of set of wrapper functions and wrapper data types to handle this for us.

Also since allocation is all from the heap it seems we can use Boehm GC. Limited stack allocations (for leaf functions) would be cool too ^^. Edit: Oops. Since we don't actually end the function until at HALT(), it seems we can't free memory allocated via alloca().

-----

1 point by sacado 5864 days ago | link

What does it mean to be a mentor ? I mean : what are the constraints ? Maybe I could mentor you if you like (and if I can apply as a mentor ) ?

-----

2 points by eds 5864 days ago | link

From Google's FAQ:

"Google does not have specific eligibility requirements for mentors, as we know our mentoring organizations will be best able to determine the selection criteria for their mentors."

I don't know if LispNYC has any specific requirements.

As a mentor, I think you have to be in email contact with your student, give them advice as necessary, and evaluate their work. I believe I read somewhere that they estimate mentoring will take about 5 hours a week depending on the number and difficulty of projects.

There is a mentor signup link on LispNYC's page http://lispnyc.org/soc2008.clp. Google has some information about using the mentoring web app at http://groups.google.com/group/google-summer-of-code-announc....

-----

3 points by almkglor 5863 days ago | link

If there's no need to be physically there, I'm willing to mentor you, in case sacado is somehow disqualified or is otherwise unable to mentor you (I'll probably ask sacado to unofficially mentor my mentoring you, though, sort of a meta-mentor). Anyway I've just applied, although I'll gladly defer to sacado (it'll be mostly his code anyway) if he is able to mentor you.

Note that due to various circumstances, I won't be able to leave my country until about a year or so, so if my physical presence is needed, sorry!

-----

1 point by eds 5863 days ago | link

Thanks (although it looks like sacado did sign up). And I am pretty sure there is no requirement to be physically anywhere in particular.

-----

1 point by eds 5864 days ago | link

Ok, I submitted my proposal to GSoC. If you'd like to mentor, I would appreciate it. (Otherwise I can't get funded.) And please note that if you want to mentor, you need to apply today, or tomorrow (March 31) before 5:00 PM PDT.

P.S. if you would like to read the proposal, I have posted it up at http://blackthorncentral.net/node/49.

-----

3 points by sacado 5863 days ago | link

Done. I submitted as a mentor from Google's form. I'm waiting for my subscription to be validated now.

-----

1 point by eds 5863 days ago | link

Thanks!

P.S. As it happens, Google just extended the deadline to April 7th....

-----

3 points by binx 5862 days ago | link

For bignums, just look at the url:

http://gmplib.org/

-----

1 point by almkglor 5863 days ago | link

Come to think of it, it seems that collections-in-function-position would make function calls difficult and/or slightly slower. And then there's the annotate thingy.

Hmm. I think to properly support annotate we really need to attach two types to the object: the real type of the representation and the user-declared type of the representation.

-----

3 points by stefano 5863 days ago | link

Attaching too much information to an object leads to serious problems. Think about fixnums: they're usually implemented using 2 bits out of 32 as a type tag. This way you don't have to allocate them on the heap. If you attach an extra information you have to put everything on the heap (very slow for fixnums).

As of collections-in-function-position instead of returningm for example, and hash-table one could return a fuction that acts as an interface towards the real hash-table.

  (def table () 
    (let tb (make-real-table)
      (fn (key) (gethash tb key))))
This doesn't handle setting values, but I think that a few macros could solve the problem.

-----

2 points by almkglor 5863 days ago | link

> Attaching too much information to an object leads to serious problems. Think about fixnums: they're usually implemented using 2 bits out of 32 as a type tag. This way you don't have to allocate them on the heap. If you attach an extra information you have to put everything on the heap (very slow for fixnums).

Then let's not attach both types to the same object. pg's implementation creates a new object and attaches the type to that object, leaving the original representation object (type id and all) untouched. Basically we have an annotation type which just contains the attached type and the original object, leaving the type id bits untouched.

> This doesn't handle setting values, but I think that a few macros could solve the problem.

Looks like a job for settable-fn ^^

-----

3 points by sacado 5863 days ago | link

Yes. I implemented fixnums with the last bit as a tag-bit. If it's 0, everything else is a fixnum. If not, well, it's something else... Maybe nil, t or a reference to something else.

Your idea for callable collections seems quite good, too. But I'm not there yet.

-----

1 point by stefano 5863 days ago | link

Wow. I can't wait. I'm also writing a compiler following this tutorial: http://www.cs.indiana.edu/~aghuloum/compilers-tutorial-2006-... but it's just an exercise.

-----