Arc Forum | I'm no expert, but with Unicode strings, doesn't the "character" abstraction bre...

Arc Forum

1 point by are 6379 days ago | link | parent

I'm no expert, but with Unicode strings, doesn't the "character" abstraction break down anyway, when a "character" sometimes needs to have multiple glyphs? With "characters" removed, wouldn't a a multi-glyph "character" just be a length 1+ string also? Or am I just talking out of my arse here?

4 points by sacado 6379 days ago | link

If I'm right, there are many concepts behind Unicode :

A Unicode string is a sequence of codepoints (or characters).

A codepoint (or character) is a numeric id that can be represented by many ways (UTF-32 : always 4 bytes ; UTF-8 : only one byte for codepoints < 128 ; from 2 to 4 bytes for codepoints >= 128 ; ...).

A glyph is what you display on the screen : a Unicode string is displayed as a sequence of glyphs. A glyph can be a single character, or the combination of 2 or more characters.

e.g., 10 glyphs on the screen can be represented as 11 characters (the two last ones being composed in a single glyph). Depending on the underlying "physical" encoding, these 11 characters can occupy 44 bytes (UTF-32) (with a O(1) access to substrings) or, say, 25 bytes (UTF-8) (with a O(n) access to substrings), or ...

In a few words : characters have a meaning in Unicode, but they don't match well with bytes (and even with physical representation) and, sometimes, with the way things are representing on the string.

Correct me if I'm wrong.

-----

2 points by almkglor 6379 days ago | link

As far as I know a "glyph" has a one-to-one mapping to a character, where "glyph" means the on-screen symbol used to represent the character (not sure whether there exist multi-glyph single characters - although I do think that there are characters which when in some sequence end up being displayed in one glyph, even though they are logically separate characters).

Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

-----

3 points by olavk 6376 days ago | link

I believe that e.g. accented characters like é are implemented as a single glyph in fonts, but are composed of two unicode code points: the base character (e) and a modifier character (´).

This is complicated by the issue that unicode also supports the combined character as a seperate single code point, for backwards compatibility with legacy character sets. However the decomposed (normalized) form is the recommended.

-----

1 point by almkglor 6375 days ago | link

True. A bit of research also suggests that it would be better for both forms to be considered "equal" when comparing individual characters.

-----

1 point by are 6375 days ago | link

> Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

I know I shouldn't have dipped my ignorant toe into Unicode waters :-)

Maybe a better question would be: If Arc got rid of the character datatype by collapsing strings-and-characters into strings-and-substrings, could you leave "how to represent a string" (chars vs. octets vs. bytes vs. code points vs. glyphs) out of the language spec altogether? Or would such a "clean" string abstraction conflict with having Unicode support (since Unicode is deeply encoding-specific)?

-----