Arc Forum | Characterset; MzScheme gives you UTF-8 anyway?

Arc Forum

	Characterset; MzScheme gives you UTF-8 anyway?
	8 points by chrisdone 6365 days ago \| 14 comments
	I find it strange that Arc is written in MzScheme, yet is not supposed to support UTF-8. Surely that would be done for you? Can you explain this, Paul?

2 points by ryszard_szopa 6364 days ago | link

At this moment Unicode is an implementation detail rather than a language feature.

I must say it: not supporting Unicode (or: explicitly planning not to support it) is a BAD thing. You will hardly notice it if you come from the US. It may get a bit tricky if you come from the UK, as you may want want to use the pound or euro symbol. If you come from a diacritics-rich language, then you may start feeling stupid. Prepare to serve yourself and your users communicates like:

"Sarra, thas cammanacata has baan adaptad ta tha fana pragrammang langaaga wa ara asang." ("Sorry, this communicate has been adapted to the fine programming language we are using."---it is not that hard guess after all, ain't it?)

No, PG, please don't be that guy.

Python's Unicode support sucked badly at the beginning, but they kept improving it. Right now it is kinda acceptable (though I regularly spend some time debugging Unicode errors---you'd imagine by now I would get used to it), in Py3k is hopefully gonna be made right. Ruby Unicode support still sucks, and that is basically why I don't use it (even though I like its semantics a lot, as it is more lispy than Python). Not being able to divide a word from your own language into three character substrings (Unicode characters use more than one byte) is plainly ridiculous... Even on the prototype level.

Of course, no one says it has to be done right now. But I'd like to know it is in the plans.

-----

1 point by mdemare 6364 days ago | link

Ruby's unicode support is acceptable in 1.8, and good in 1.9. I'm not asking for the world, I just want string to be able to contain text in any encoding, and to be able to split a string into chars, given a encoding.

-----

1 point by immanuel 6364 days ago | link

I would like to use numbers in various encoding like reversed (bigendian on little endian machines and vice versa). I also want the language to natively support all these number encondings and to be able to add two numbers, given their encodings.

-----

2 points by nex3 6365 days ago | link

Arc actually does support UTF-8.

  arc> ("uber" 0)
  #\u

I imagine it only officially supports ASCII because it will be migrated away from MzScheme eventually.

Note: Those "u"s are supposed to have umlauts, but that's apparently normalized away somewhere. The point is, u with an umlaut is treated as a single character by the current implementation.

-----

1 point by mascarenhas 6365 days ago | link

Well, indexing will most certainly break, but making an encoding agnostic reader/writer is easy, I hope PG does that when/if Arc goes standalone.

-----

1 point by nex3 6365 days ago | link

I'm sure it'll be agnostic, if by "agnostic" you mean that it just reads in strings as a sequence of bytes. It would be easier to do that than to check for non-ASCII characters and handle them specially.

-----

3 points by will 6365 days ago | link

(def sanskrit () "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥") #<procedure: sanskrit> arc> (sanskrit) "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" arc> ((sanskrit) 0) #\क arc> ((sanskrit) 2) #\च

-----

2 points by will 6365 days ago | link

Well ... things look good in the REPL, but not so good in the forum. Sorry for the crud.

-----

1 point by asolove 6364 days ago | link

The real question is: how many people will repeat the ASCII business without checking when the thing supports utf8 just fine out of the box? (I'm working on a project in Chinese right now.)

-----

3 points by mdemare 6364 days ago | link

Ascii-only support is explicitly in the release notes, and scares me. But I'm glad that utf-8 happens to work right now.

-----

1 point by gregwebs 6365 days ago | link

Did you read the announcement? http://paulgraham.com/arc0.html

-----

6 points by mdemare 6364 days ago | link

I read his announcement and I completely disagree. Strings are pretty basic, and getting them right is part of the work of a language designer. They're more important than macros. Not getting strings right can cripple a language.

And to call not supporting unicode "offensive" is missing the point. Only supporting ascii makes the language less powerful. It means you can't use Arc for solving problems involving text manipulation in languages other than English. That's a big space. Only supporting UTF-8 would make more sense.

And for all the Java bashing nowadays, Java got Strings right, and Perl, Python, PHP and Ruby didn't.

-----

1 point by Elfan 6364 days ago | link

Java unicode support has historical been a mess too. They assumed that 16 bits would always and forever be enough for any code point. This was only "fixed" in 2004 and the warts are still there.

I suppose the lesson to take away is that just about every single language has messed up characters sets. It can't then be a fatal mistake but certainly isn't one that makes any sense to repeat.

-----

2 points by ank 6364 days ago | link

I can't believe he said that. At this stage no one really expected Arc to have any sort of UTF/Unicode/I18N support. He should have kept that for himself and then the users would have built the libraries on top of Arc. Well, I guess time will tell. Will keep an eye on it.

http://fixingsoftware.blogspot.com/2008/01/arc-has-been-rele...

-----