Arc Forumnew | comments | leaders | submitlogin
1 point by weeble 5902 days ago | link | parent

I think the point is that, in the presence of combining diacritics, even 32 bits isn't enough. A character is (roughly) one "base" 32-bit code plus zero or more "combining" 32-bit codes. And equality between two characters isn't purely structural - you might re-order its combining codes or use a pre-combined code. (Not all combinations have pre-combined codes.)

I will point out that I know very little about Unicode, so I might be a bit off. I can't say that I'm even very interested in the whole Unicode debate, so long as it all gets sorted out at some point in the future.



1 point by tree 5902 days ago | link

The only reason Unicode contains combined forms is for compatibility with existing standards: you cannot invent new code points representing a novel combination of base and combining characters. The Unicode normalization forms deal with these issues.

Unicode support is a complex issue: fundamentally there are the issues of low-level character representation (e.g., internal representation) followed by library support to handle normalization and higher-level text processing operations.

-----

1 point by olavk 5902 days ago | link

True, I should have said unicode code points rather than characters. I believe the fundamentals is that strings should always be sequences of unicode code points, and shouldn't be conflated with byte arrays. The thorny issues of normalization, comparing, sorting, rendering combined characters and so on could be handled with libraries at a later stage.

-----