"This is a bit of a spec wormhole (as akkartik calls it ^_^ )"
I have no clue what you're talking about.
---
"If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?"
I don't see why not...
---
"Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode?"
Invalid Unicode is not allowed.
---
"If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode."
I have no clue where you got that idea from... I'm assuming you mean that JSON is encoded in UTF-16.
UTF-16 is just a particular encoding of Unicode that happens to use two or four 8-bit bytes, that's all. UTF-16 can currently handle all valid Unicode and doesn't allow for invalid Unicode.
But JSON doesn't even use UTF-16. Just like Nuit, JSON uses "sequences of Unicode characters" for its strings. And also like Nuit, JSON doesn't specify the encoding: neither "json.org" or Wikipedia make any mention of encoding. And the JSON RFC (https://tools.ietf.org/html/rfc4627) says:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
All characters mentioned in this specification are Unicode code points. Each
such code point is written as one or more bytes depending on the character
encoding used. Note that in UTF-16, characters above #xFFFF are written as
four bytes, using a surrogate pair.
The character encoding is a presentation detail and must not be used to
convey content information.
On input, a YAML processor must support the UTF-8 and UTF-16 character
encodings. For JSON compatibility, the UTF-32 encodings must also be
supported.
A conforming implementation of this Standard shall interpret characters in
conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
implementation level 3.
And I believe Java also uses UTF-16.
But I see no reason to limit Nuit to only certain encodings. And if I did decide to specify a One True Encoding To Rule Them All, I'd specify UTF-8 because it's the overall best Unicode encoding that we have right now.
Instead, if a Nuit parser/serializer is used on a string that it can't decode/encode, it just throws an error. It's very highly recommended to support at least UTF-8, but any Unicode encoding will do.
---
"Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)"
The current spec for Nuit says to throw an error for byte order marks appearing in the middle of the file.
---
"Zero? How do you encode a zero-length list in Nuit?"
Easy, just use a plain @ with nothing after it:
@
@foo
The above is equivalent to the JSON [] and ["foo"]
---
"Is there a way to encode a list whose first element is a list?"
Yes:
@
@bar qux
And I was thinking about changing the spec so that everything after the first string is treated as a sigil rather than as a string. Then you could say this:
Ah, I get to learn a few new things about JSON! JSON strings are limited to valid Unicode characters, and "A JSON text is a serialized object or array," not a number, a boolean, or null. All this time I thought these were just common misconceptions! XD
It turns out my own misconceptions about JSON are based on ECMAScript 5.
To start, ECMAScript 5 is very specific about the fact that ECMAScript strings are arbitrary sequences of unsigned 16-bit values.
4.3.16
String value
primitive value that is a finite ordered sequence of zero or more
16-bit unsigned integer
NOTE A String value is a member of the String type. Each integer
value in the sequence usually represents a single 16-bit unit of
UTF-16 text. However, ECMAScript does not place any restrictions or
requirements on the values except that they must be 16-bit unsigned
integers.
ECMAScript 5's specification of JSON.parse and JSON.stringify explicitly calls out the JSON spec, but then it relaxes the constraint that the top level of the value must be an object or array, and it subtly (maybe too subtly) relaxes the constraint that the strings must contain valid Unicode: It says "The JSON interchange format used in this specification is exactly that described by RFC 4627 with two exceptions," and one of those exceptions is that conforming implentations of ECMAScript 5 aren't allowed to implement their own extensions to the JSON format, and must instead use exactly the format defined by ECMAScript 5. As it happens, the formal JSON grammar defined by ECMAScript 5 supports invalid Unicode.
---
"This follows naturally if you assume that empty strings aren't included in the list."
I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different.
"I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different."
Then I'll change the spec to be more understandable. What wording would you prefer?
I'll try to make it a minimal change: "If there's anything between the @ and the first whitespace character, that intervening string is the first element of the list."
Well, it's true that the Nuit spec intentionally ignores encoding issues, and thus a Nuit parser/serializer might need to understand encoding in addition to the Nuit spec. I don't see a problem with that.
The Arc implementation of Nuit basically just ignores encoding issues because Racket already takes care of all that. So any encoding information in the Nuit spec would have just been a nuisance.
There's already plenty of information out there about different Unicode encodings, so people can just use that if they don't have the luxury of relying on a system like Racket.
---
I see encoding as having to do with the storage and transportation of text, which is certainly important, but it's beyond the scope of Nuit.
Perhaps a Nuit serializer wants to use ASCII because the system it's communicating with doesn't support Unicode. It could then use Punycode encoding.
Or perhaps the Nuit document contains lots of Asian symbols (Japanese, Chinese, etc.) and so the serializer wants to use an encoding that is better (faster or smaller) for those languages.
Or perhaps it's transmitting over HTTP in which case it must use CR+LF line endings and will probably want to use UTF-8.
---
I'll note that Nuit also doesn't specify much about line endings. It says that the parser must convert line endings to U+000A but it doesn't say what to do when serializing.
If serializing to a file on Windows, the serializer probably wants to use CR+LF. If on Linux it would want to use LF. If transmitting over HTTP it must use CR+LF, etc.
Nuit also doesn't specify endianness, or whether a list should map to an array or a vector, or how big a byte should be, or whether the computer system is digital/analog/quantum, or or or...
Nuit shouldn't even be worrying about such things. Nuit shouldn't have to specify every tiny miniscule detail of how to accomplish things.
They are implementation details, which should be handled by the parsers/serializers on a case-by-case basis, in the way that seems best to them.