Arc Forum | Well, I'm amazed. Thanks for your involvement in that project ! As for strings &...

Arc Forum

1 point by sacado 6347 days ago | link | parent

Well, I'm amazed. Thanks for your involvement in that project ! As for strings & unicode, I guess there are good libraries existing.

1 point by almkglor 6347 days ago | link

You're welcome. You can prolly review bits of the code via github if you don't have access to your own computer this week.

Wonder how eds is doing on arc2c? BTW have you requested to mentor his GSoC application? If you already did I'll withdraw my request.

The bit about strings is - how do we represent them? UTF-8? UTF-32? As an array or list of characters?

Arc's underlying mzscheme divides strings into code points; each code point is representable by a single 32-bit number (I think). An individual "character" in mzscheme is thus a code point (from what I gather), although in Unicode a character could be represented by several code points (or so I hear).

Now the point is that the following is quite valid:

  arc> (= p "asdf")
  "asdf"
  arc> (= (p 0) #\c)
  #\c
  arc> p
  "csdf"

So obviously access to individual characters should be easy, and replacing individual characters should probably not cause us to mess too much with memory. This almost prevents the use of UTF-8, even though all I/O will pretty much just use UTF-8.

-----

2 points by kens 6347 days ago | link

Yes, I think UTF-8 would be a disaster with modifiable strings. mzscheme uses UCS-4 (UTF-32) internally, and that would be the simplest approach. If you are willing to ignore Unicode characters > 65536, then UCS-2 would be okay with half the memory usage. When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536? (Oversimplifying, two UCS-2 surrogate characters are used to represent one Unicode code point > 65536.) I think you'd be better off with UTF-32 than UTF-16 and surrogates, as surrogates look like a nightmare that you'd only want if you need backwards compatibility, see Java's character support: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character....

-----

1 point by almkglor 6347 days ago | link

> When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536?

Actually I'm talking about so-called "combining characters" http://en.wikipedia.org/wiki/Combining_character

Normalization... hahahaha unicode unicode headaches headaches! http://en.wikipedia.org/wiki/Unicode_normalization

-----

4 points by kens2 6346 days ago | link

Oh, Unicode combining characters and normalization. I classify that as "somebody else's problem." Specifically, if you're writing a font rendering engine, it's your problem. If you're writing an Arc compiler, it's not your problem. If you want complete Unicode library support in your language (like MzScheme's normalization functions string-normalize-nfd, etc.), then you just use an existing library such as ICU, and it's not your problem. ICU: http://www-306.ibm.com/software/globalization/icu/index.jsp

-----

1 point by eds 6346 days ago | link

> Wonder how eds is doing on arc2c?

I've been following up on the forum threads but I haven't had time to actually read the code yet. (And the last time I checked, the arc2c executable gave me a segfault.)

-----

1 point by almkglor 6346 days ago | link

LOL. In any case to reduce the possibility of things being screwy I do the following on my C output:

  //#include<gc.h>
  #define GC_MALLOC malloc
  #define GC_INIT()

The current arc2c output assumes that you have a proper Boehm GC installation, but since I can't seem to get a good install here (prolly something to do with being AMD64 again) I just disable the GC for now.

Hmm, can you try on a later version?

-----

2 points by eds 6346 days ago | link

I finally got arc2c to work (without GC as you suggested). One note though: apparently arc2c relies on rm-global.arc, but doesn't load it by default. So under the current version of the compiler, 'compile-file will error until you (load "rm-global.arc").

Now I just have to get it to work with GC...

-----

2 points by almkglor 6346 days ago | link

Oops. Must have forgotten to add it to arc2c.arc then ^^. Unfortunately I won't be able to fix this until maybe 7 hours from now T.T, haha, I'm in the office ^^

The thing about GC working: well, you need to somehow download the development version of Boehm GC, and, well, that's what's stopping me for now T.T

-----

1 point by eds 6346 days ago | link

Are your latest changes on Anarki? Even after "git pull" I still seem to have the old version.

-----