Arc Forum | As far as I know a "glyph" has a one-to-one mapping to a character, where "glyph...

Arc Forum

2 points by almkglor 6318 days ago | link | parent

As far as I know a "glyph" has a one-to-one mapping to a character, where "glyph" means the on-screen symbol used to represent the character (not sure whether there exist multi-glyph single characters - although I do think that there are characters which when in some sequence end up being displayed in one glyph, even though they are logically separate characters).

Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

3 points by olavk 6315 days ago | link

I believe that e.g. accented characters like é are implemented as a single glyph in fonts, but are composed of two unicode code points: the base character (e) and a modifier character (´).

This is complicated by the issue that unicode also supports the combined character as a seperate single code point, for backwards compatibility with legacy character sets. However the decomposed (normalized) form is the recommended.

-----

1 point by almkglor 6314 days ago | link

True. A bit of research also suggests that it would be better for both forms to be considered "equal" when comparing individual characters.

-----

1 point by are 6314 days ago | link

> Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?

I know I shouldn't have dipped my ignorant toe into Unicode waters :-)

Maybe a better question would be: If Arc got rid of the character datatype by collapsing strings-and-characters into strings-and-substrings, could you leave "how to represent a string" (chars vs. octets vs. bytes vs. code points vs. glyphs) out of the language spec altogether? Or would such a "clean" string abstraction conflict with having Unicode support (since Unicode is deeply encoding-specific)?

-----