Arc Forumnew | comments | leaders | submitlogin
Unicode a matter of political (in)correctness?
15 points by fauigerzigerk 6141 days ago | 20 comments
There seems to be a (probably innocent) misunderstanding about the role of diacritics in languages. Peking and Beijing differ syntactically but not in terms of meaning (at least not in most contexts). Anyone who can put away their emotions for a moment understands what it means. So yes, this is a matter of political correctness.

However, there is a large number of cases in many languages where diacritics change the meaning of a word or phrase. In spanish "Escribo un libro" means "I'm writing a book". With an accent on top of the o in "Escribo" it means "he (or she) wrote a book". Equally, in german, "musste" with an umlaut means something different than the same word without the umlaut. So it's not a matter of political correctness and not even just one of correct spelling. It's a matter of meaning and it can be a legal issue if I'm unable to correctly store a person's name.

Is it politically correct to exclude the majority of internet users from my website? To be honest, I don't care. What I do care about is whether it's economically viable. And that question, for me, comes down to whether or not I plan to ever create a website for that other group of people who are not exclusively english speaking or may care about the correct spelling of their names.

And yes, I do feel bad that the first thing I have to say about a fantastic new programming language is about characters sets. I hate character set issues and I fully understand the sentiment of not wanting to spend even a single day on it. But I've been forced to deal with it many times and I know all too well what happens if it's not considered from the outset. Make no mistake, there's no way whatsoever to avoid this issue, and any solution has to be part of the programming language itself.



4 points by eduardoflores 6141 days ago | link

Lack of unicode may be an adoption stopper for many. I could not make a serious spanish app without, maybe, some heavy workarounds on arc; or even worst, trying to make a multilanguage app.

BTW, actually, in spanish "Escribo un libro" would be "I'm writting a book" while "Escribio un libro" ['o] would be "(he/she) wrote a book". "Escribo"['o] is an ortographic mistake. But you're right in the fact that the placement of the accent may change the meaning: "Yo cambie el significado" [no accent] means "I'll change the meaning" while "Yo cambie el significado" ['e] means "I changed the meaning". Jose['e] is a male name while Jose [nacc] is a female name. Writing Me[']xico instead of Mexico is also an ortographic mistake while writting Espana instead of Espan[~]a is unthinkable. (And you can easily figure the meaning of ano (year) written with 'n' :S )

-----

1 point by fauigerzigerk 6141 days ago | link

Thanks a lot for helping with my horrible spanish! It shows that knowing a language is even harder than getting the character sets right. The "ano" thing is frightening. I'll never open my mouth again unless I know the language really really well ;-)

-----

4 points by Xichekolas 6141 days ago | link

If you are scared to open your mouth and sound like a moron, you'll never get any practice at all.

Learning a second language is a lesson in humility. Trying going to a spanish cafe and asking for a 'bocadillo de polla' (instead of the correct 'bocadillo de pollo') ... that 'a' at the end is the difference between a 'dick sandwich' and what you most likely really wanted, a 'chicken sandwich'.

Had to learn that one the hard way myself. I'm sure the waiter in Sevilla is still laughing.

-----

4 points by jgrahamc 6141 days ago | link

No, open your mouth, make mistakes, that's the only way I truly became fluent in French.

-----

7 points by partdavid 6141 days ago | link

What I don't really understand is why strings are constrained at all... why aren't they just lists of integers? Graham himself in essays has suggested exactly this and other functional languages use it.

-----

2 points by serhei 6141 days ago | link

I'm guessing that pg wrote it to use lists, but it turned out too slow (he has a site to run using arc), so he's using regular strings for the time being. The first thing you do when writing programs in a language like Haskell that munge large amounts of text, for instance, is to stop using the built-in strings-as-lists and switch to some sensible library like ByteString.

-----

2 points by markhughes 6141 days ago | link

It's not "political correctness" to be a civilized, international human being, and support every language. It's an appalling display of barbarism, ignorance, closed-mindedness, and bigotry to claim that ASCII is all you need.

Guido had to spend a year on Unicode because Unicode is hard. But he spent that year on it because Unicode is essential to any modern programming language. Py3K is a massive improvement over Python 2 largely because it gets Unicode right.

Java is a massive improvement over C++ in many ways, and one of those ways is that it's a pure Unicode environment (Unicode has since moved beyond UTF-16 to support 32-bit code points, and Java has evolved to deal with those).

Arc is a failure.

-----

4 points by pg 6141 days ago | link

Think about this Mark: The Lisp in McCarthy's 1960 paper didn't deal with character sets. Why was that? Because he was a barbarian? Or because he was at a stage where he was dealing with other issues?

-----

7 points by icey 6141 days ago | link

Paul, I think a lot of the arc emo-kids would be assuaged if you would talk a little bit more about whether you outright reject the possibility of unicode in Arc versus whether or not you just hadn't gotten to it.

I mean, I think a lot of this would die down if you were to say something like "Look, I'm building a language, I haven't gotten to unicode. If someone builds a patch that works, I'd be happy to integrate it in". Or something like that.

I know it would make a difference to me, and would let a lot of people look past the fact that it's missing _right now_

-----

1 point by Xichekolas 6141 days ago | link

Isn't that what he pretty much said on his personal page? http://www.paulgraham.com/arc0.html

As I read it, character set support isn't that interesting to him right now, and he'd rather focus on other things. Everyone keeps treating it like this version of the language is the final one. Of course it's going to constantly gain new functionality.

-----

3 points by fauigerzigerk 6140 days ago | link

If that had been my impression I'd never have posted anything on this rather boring topic. What made me think it might be a permanent design decision is this extract from the Arc intro: "[...] it doesn't support any character sets except ascii. Such things may have their uses, but there's also a place for a language that skips them"

-----

1 point by icey 6140 days ago | link

The problem is that it's left open to interpretation. Sure, he said Arc is a work in progress, and everyone gets that. The problem is knowing what's on the table for change, and what's not.

-----

2 points by noahlt 6140 days ago | link

The Lisp in McCarthy's paper was a theoretical model for computation, not a language implementation made to build practical websites with.

I agree that you should get the other issues right first; they're more important in the long run. What annoys most people about the problem with character sets is that it means they can't go out and build the websites they want to using Arc yet.

-----

1 point by markhughes 6140 days ago | link

It's because McCarthy, like most people of the time, was a barbarian by any modern standard, and at that time, the users of the language were almost without exception white, middle-class, non-"immigrant" (in the last 100 years or so) Americans. ASCII was good enough for them. Character sets weren't expanded to deal with the other 6 billion people on the planet until later.

There's no point in using a language that's going to exclude 95% of humanity.

-----

3 points by eandjsfilmcrew 6140 days ago | link

> There's no point in using a language > that's going to exclude 95% of humanity.

So true. I, for example, don't see the point in using French. :)

-----

1 point by palish 6140 days ago | link

So what exactly have you created lately?

It's way easier to be a critic than a maker. Also, read this: http://www.jwz.org/doc/worse-is-better.html

-----

1 point by papersmith 6141 days ago | link

Just curious, what issues would there be when transitioning to UTF-8? Isn't it backward compatible with ASCII?

-----

2 points by kmag 6140 days ago | link

The simple answer is "Yes, UTF-8 is backward compatible with ASCII, and the designers of UTF-8 were very clever to make the common use cases efficient and robust in the face of minor data corruption" ... but ...

For a start, the substring and string length code would need to be rewritten. If a lot of substring operations were to be performed, you'd probably want to lazily construct skiplists or some other data structure to memoize the character/codepoint indexing operations.

Unicode has a notion of "codepoints" that correspond roughly with what you probably think of as characters. However, what you think of as a single character may sometimes be a single codepoint and may sometimes be multiple codepoints. The most simple way of treating Unicode will force all of the complexity of Unicode onto all of the users of the language, regardless of which encoding is used.

You'd probably want to declare a standard Unicode normalization that gets performed on all byte streams when converting them into strings. (You may want an optional parameter that allows programmers to override the default normalization, but that adds a lot of headaches.) What you or I probably think of as a single character can potentially be represented in more than one way in Unicode. Presumably, string comparison would work codepoint by codepoint for efficiency reasons. In order for string comparison to not be horribly confusing, strings should probably be normalized internally so that there's only one internal representation for each character. (For instance, there's a Unicode codepoint for the letter u, there's a codepoint for u with an umlaut, and there's a codepoint that adds an umlaut to an adjacent character. In this way, a u with an umlaut can be represented with one or two codepoints. Similarly, (as I remember) each Korean character can be represented as a single codepoint or as a sequence of 3 codepoints for the 3 phonemes in the character (including a codepoint for a silent consonant). Han unification characters (Chinese logograms and their equivalents in written Japanese and Korean) can be represented as single codepoints or as codepoints for simple graphical elements and codepoints for composing the elements into characters.) There are several standards for normalizing Unicode, but most of them boil down to either always representing each character with as few codepoints as possible, or else always decomposing each character into codepoints for its most basic elements.

Perhaps UTF-8 strings could be represented internally as a byte array and an enum indicating the normalization used (plus an array length and a codepoint count). This would allow a fast code path for indexing, substring operations, pattern matching, and comparison of pure ASCII strings, as well as a fast code path for pattern matching and comparison of two strings that used the default normalization. Comparison between two strings that used non-default normalization (even if the two strings use the same normalization) would need to involve a renormalization step in order to prevent the case where a > b, b > c and c > a. (Different normalizations could potentially result in different orderings, so for a globally consistent ordering, the same normalization must be used in all comparisons.)

It may be wise for a string to internally keep track of normalization inconsistencies in the original input byte array, so that arrays of bytes can be converted into strings that can then be converted back to the original byte arrays. One would hope that most strings being imported into the system would be internally consistent in their normalization so that this "inconsistencies annotations" structure would be nil in the vast majority of cases.

Pattern matching, such as regexes, should also involve normalization so as to not trap the unwary programmer.

There are also corner cases for things like the Latin wide characters. Basically, CJK (Chinese, Japanese, and Korean) characters are about as wide as they are tall, but Latin characters are about half as wide as they are tall. When mixed with CJK characters, Latin characters look much better when they're stretched horizontally. Unicode has a set of codepoints for normal Latin characters and a different set of codepoints for the wide versions of the Latin characters. Should the default unicode normalization turn the wide Latin characters into normal Latin characters?

In summary, UTF-8 is absolutely brilliant in the way it treats the subtleties of being backward-compatible with ASCII. However, there's no simple way to deal with Unicode in a way that both preserves certain identities and also isn't a hazard for unwary programmers.

-----

1 point by papersmith 6138 days ago | link

Wow, thanks for the thorough reply. That explains it a lot.

-----

2 points by kmag 6140 days ago | link

I should also point out that dealing with non-normalized Unicode has caused many security bugs in the past. For instance, sometimes input validation routines have bugs where they only test for the normalized forms of unsafe inputs.

Imagine a website that checks that user-submitted JavaScript fragments only contain a "safe" subset of JavaScript, disallowing eval(). Now imagine that the user input contains eval with the "e" represented as a Latin wide character and imagine that some browser understands Latin wide versions of JavaScript keywords, or that the code that renders web pages to HTML transforms Latin wide characters to standard Latin characters.

-----