You're welcome. You can prolly review bits of the code via github if you don't have access to your own computer this week.
Wonder how eds is doing on arc2c? BTW have you requested to mentor his GSoC application? If you already did I'll withdraw my request.
The bit about strings is - how do we represent them? UTF-8? UTF-32? As an array or list of characters?
Arc's underlying mzscheme divides strings into code points; each code point is representable by a single 32-bit number (I think). An individual "character" in mzscheme is thus a code point (from what I gather), although in Unicode a character could be represented by several code points (or so I hear).
Now the point is that the following is quite valid:
arc> (= p "asdf")
arc> (= (p 0) #\c)
So obviously access to individual characters should be easy, and replacing individual characters should probably not cause us to mess too much with memory. This almost prevents the use of UTF-8, even though all I/O will pretty much just use UTF-8.
Yes, I think UTF-8 would be a disaster with modifiable strings. mzscheme uses UCS-4 (UTF-32) internally, and that would be the simplest approach. If you are willing to ignore Unicode characters > 65536, then UCS-2 would be okay with half the memory usage. When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536? (Oversimplifying, two UCS-2 surrogate characters are used to represent one Unicode code point > 65536.) I think you'd be better off with UTF-32 than UTF-16 and surrogates, as surrogates look like a nightmare that you'd only want if you need backwards compatibility, see Java's character support: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character....
Oh, Unicode combining characters and normalization. I classify that as "somebody else's problem." Specifically, if you're writing a font rendering engine, it's your problem. If you're writing an Arc compiler, it's not your problem. If you want complete Unicode library support in your language (like MzScheme's normalization functions string-normalize-nfd, etc.), then you just use an existing library such as ICU, and it's not your problem. ICU: http://www-306.ibm.com/software/globalization/icu/index.jsp
The current arc2c output assumes that you have a proper Boehm GC installation, but since I can't seem to get a good install here (prolly something to do with being AMD64 again) I just disable the GC for now.
I finally got arc2c to work (without GC as you suggested). One note though: apparently arc2c relies on rm-global.arc, but doesn't load it by default. So under the current version of the compiler, 'compile-file will error until you (load "rm-global.arc").