To be precise, apparently the result of parsing a Nuit value is either of the following:
- A Nuit string, which is a finite sequence of 16-bit values (just like a JavaScript or JSON string).
- A pair consisting of a Nuit string and a finite sequence of Nuit values. This recursion can't create cycles.
When translating Nuit to JSON, Nuit strings become JSON strings, and string-sequence pairs become JSON Arrays where the first element is a string.
Is this right?
---
Raw text appearing in Nuit's surface syntax (which starts as UTF-8, as you specify) becomes encoded as a sequence of UTF-16 code points, right? You just use the word "characters" as if it's obvious, but if you want strings to be sequences of full Unicode code points, your 16-bit escape sequences aren't sufficient.
Does the byte-order mark have any effect on the indentation of the first line?
What if the first line is indented but it occurs after a shebang line? Does the # consume it?
If I understand correctly, every Nuit comment must take up at least one whole line. There's no end-of-line comment. Is this intentional?
---
When I use JSON, I often encode richer kinds of data in the form {"type":"mydatatype",...} or rarely ["mydatatype",...]. Here's a stab at encoding richer data (in this case JSON!) inside Nuit:
[{a:1,b:null},"null"]
-->
@array
@obj
a
@number 1
b
@null
@string null
I don't have an opinion about this yet, but it's something to contemplate.
"A Nuit string, which is a finite sequence of 16-bit values (just like a JavaScript or JSON string)."
A finite sequence of Unicode characters. UTF-8 is recommended, but the encoding can be any Unicode encoding (UTF-32/16/8/7, Punycode, etc.)
---
"A pair consisting of a Nuit string and a finite sequence of Nuit values. This recursion can't create cycles."
No, because it uses the abstract concept of "list", which might map to a vector, array, cons, binary tree, etc. The only requirement is that it can hold 0 or more strings in order. How it's represented in a particular programming language is an implementation detail, not part of the specification.
---
"When translating Nuit to JSON, Nuit strings become JSON strings, and string-sequence pairs become JSON Arrays where the first element is a string."
Yes, except an empty Nuit list would be an empty JSON array. Also, if a meta-encoding scheme were used, it is possible for the serializer to encode Nuit as a JSON object, number, etc. But that's just de facto conventions, not part of the spec.
---
"Raw text appearing in Nuit's surface syntax (which starts as UTF-8, as you specify)"
Actually, the spec doesn't mention any encoding at all. It deals only with Unicode characters, with the encoding being an implementation detail. Parsers/serializers can use any encoding they want, as long as it supports Unicode. Even Punycode could be used.
In the "Size comparison" section I mention that it is assumed that UTF-8 is used in serialization. That was just so that the bytes would be consistent between the different examples.
It's also useful because it mimics a common situation found when transmitting data over HTTP, so it's closer to a "real world" benchmark rather than a synthetic one. That's also why CR+LF line endings were used rather than just LF.
As noted at the very bottom, if LF or CR endings are used, then Nuit becomes even shorter. This means that even in the worst-case scenario of CR+LF, Nuit is still shorter than JSON.
---
"You just use the word "characters" as if it's obvious, but if you want strings to be sequences of full Unicode code points, your 16-bit escape sequences aren't sufficient."
Incorrect. UTF-7/8 and UTF-16 are capable of representing all Unicode code points. UTF-7/8 does so by using a variable number of bytes. UTF-16 does so by using surrogate pairs. Punycode does so by using dark voodoo magic.
All that matters is that a string is a finite sequence of Unicode code points. How those code points are encoded is an implementation detail.
Hmmm... I think the current spec actually forbids certain valid UTF-16 strings, because surrogate pairs are forbidden. So I should change the Unicode part of the spec so it works correctly in all Unicode encodings.
---
"Does the byte-order mark have any effect on the indentation of the first line?"
Nope. It's a part of the encoding and thus is an implementation detail, so it has no effect on indentation.
---
"What if the first line is indented but it occurs after a shebang line? Does the # consume it?"
Yes, the # would consume it. If you don't want that, then the first line must not be indented. The same is true of @ and ` and " This is intentional. In fact, it's actually illegal for the first sigil to be indented. This is to help avoid the kind of mistakes that you're talking about.
---
"If I understand correctly, every Nuit comment must take up at least one whole line. There's no end-of-line comment. Is this intentional?"
That is correct and it is intentional. The design of Nuit only allows sigils at the start of a line. This makes it easy to take almost any arbitrary string and plop it in without having to quote or escape it. Which means that this Nuit code:
@foo bar
qux#nou
Would be equivalent to this JSON:
["foo", "bar", "qux#nou"]
That's part of the secret to not needing delimiters and escapes. The other part of the secret is using indentation, like with `
---
"I don't have an opinion about this yet, but it's something to contemplate."
I have already thought about such "meta-encoding schemes." Nuit itself doesn't do anything special with them, but applications can use the information to do something special. It is up to the applications to parse things in the way they want to.
I'm not against a Nuit parser/serializer using those kinds of de facto encoding schemes, but I want to keep Nuit simple, so I don't plan to put them into the spec. But, I might include some standard meta-encodings for JSON and YAML. They would be built on top of the simpler Nuit which supports only lists and strings.
This works because JSON keys must always be strings.
---
"Apparently I have to use \u(20) in order to put a space at the end of a string."
Yes, except that \ is only valid at the start of a line or within a " so you would have to prefix those lines with ":
@tag a
@attr href http://www.arclanguage.org/
"Visit\u(20)
@tag cite
Arc Forum
!
I'm still thinking about the right interaction of whitespace, ", and \ escaping. But I believe making whitespace at the end of the line illegal is an overall net gain. I might change my mind about it later.
This is a bit of a spec wormhole (as akkartik calls it ^_^ ), but go with it if you feel it's right.
If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?
Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode? If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode.
Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)
---
"The only requirement is that it can hold 0 or more strings in order."
Technically it needs to hold sub-lists too, but I know that's not your point.
Zero? How do you encode a zero-length list in Nuit?
Is there a way to encode a list whose first element is a list?
Oh, come to think of it, is there a way to encode a list whose first element is a string with whitespace inside?
---
"In fact, it's actually illegal for the first sigil to be indented."
Cool. Put it in the doc. ^_^
I assume you mean the first line, rather than the first sigil. The first line could be a sigil-free string, right?
Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?
---
"I'm not against a Nuit parser/serializer using those kinds of de facto encoding schemes, but I want to keep Nuit simple, so I don't plan to put them into the spec."
I like it that way too.
---
@attr href http://www.arclanguage.org/
Er, I think that creates the following:
[ "attr", "href http://www.arclanguage.org/" ]
---
"But I believe making whitespace at the end of the line illegal is an overall net gain."
I agree. I'm not sure I'd make it illegal, but I'd at least ignore it.
If you make whitespace at the end of blank lines illegal, bah! I like to indent my blank lines. :-p
This is partially because I've used editors which do it for me, but also because I code with whitespace visible, and a completely blank line looks like a hard boundary between completely separate blocks of code.
"This is a bit of a spec wormhole (as akkartik calls it ^_^ )"
I have no clue what you're talking about.
---
"If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?"
I don't see why not...
---
"Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode?"
Invalid Unicode is not allowed.
---
"If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode."
I have no clue where you got that idea from... I'm assuming you mean that JSON is encoded in UTF-16.
UTF-16 is just a particular encoding of Unicode that happens to use two or four 8-bit bytes, that's all. UTF-16 can currently handle all valid Unicode and doesn't allow for invalid Unicode.
But JSON doesn't even use UTF-16. Just like Nuit, JSON uses "sequences of Unicode characters" for its strings. And also like Nuit, JSON doesn't specify the encoding: neither "json.org" or Wikipedia make any mention of encoding. And the JSON RFC (https://tools.ietf.org/html/rfc4627) says:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
All characters mentioned in this specification are Unicode code points. Each
such code point is written as one or more bytes depending on the character
encoding used. Note that in UTF-16, characters above #xFFFF are written as
four bytes, using a surrogate pair.
The character encoding is a presentation detail and must not be used to
convey content information.
On input, a YAML processor must support the UTF-8 and UTF-16 character
encodings. For JSON compatibility, the UTF-32 encodings must also be
supported.
A conforming implementation of this Standard shall interpret characters in
conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
implementation level 3.
And I believe Java also uses UTF-16.
But I see no reason to limit Nuit to only certain encodings. And if I did decide to specify a One True Encoding To Rule Them All, I'd specify UTF-8 because it's the overall best Unicode encoding that we have right now.
Instead, if a Nuit parser/serializer is used on a string that it can't decode/encode, it just throws an error. It's very highly recommended to support at least UTF-8, but any Unicode encoding will do.
---
"Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)"
The current spec for Nuit says to throw an error for byte order marks appearing in the middle of the file.
---
"Zero? How do you encode a zero-length list in Nuit?"
Easy, just use a plain @ with nothing after it:
@
@foo
The above is equivalent to the JSON [] and ["foo"]
---
"Is there a way to encode a list whose first element is a list?"
Yes:
@
@bar qux
And I was thinking about changing the spec so that everything after the first string is treated as a sigil rather than as a string. Then you could say this:
Ah, I get to learn a few new things about JSON! JSON strings are limited to valid Unicode characters, and "A JSON text is a serialized object or array," not a number, a boolean, or null. All this time I thought these were just common misconceptions! XD
It turns out my own misconceptions about JSON are based on ECMAScript 5.
To start, ECMAScript 5 is very specific about the fact that ECMAScript strings are arbitrary sequences of unsigned 16-bit values.
4.3.16
String value
primitive value that is a finite ordered sequence of zero or more
16-bit unsigned integer
NOTE A String value is a member of the String type. Each integer
value in the sequence usually represents a single 16-bit unit of
UTF-16 text. However, ECMAScript does not place any restrictions or
requirements on the values except that they must be 16-bit unsigned
integers.
ECMAScript 5's specification of JSON.parse and JSON.stringify explicitly calls out the JSON spec, but then it relaxes the constraint that the top level of the value must be an object or array, and it subtly (maybe too subtly) relaxes the constraint that the strings must contain valid Unicode: It says "The JSON interchange format used in this specification is exactly that described by RFC 4627 with two exceptions," and one of those exceptions is that conforming implentations of ECMAScript 5 aren't allowed to implement their own extensions to the JSON format, and must instead use exactly the format defined by ECMAScript 5. As it happens, the formal JSON grammar defined by ECMAScript 5 supports invalid Unicode.
---
"This follows naturally if you assume that empty strings aren't included in the list."
I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different.
"I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different."
Then I'll change the spec to be more understandable. What wording would you prefer?
I'll try to make it a minimal change: "If there's anything between the @ and the first whitespace character, that intervening string is the first element of the list."
Well, it's true that the Nuit spec intentionally ignores encoding issues, and thus a Nuit parser/serializer might need to understand encoding in addition to the Nuit spec. I don't see a problem with that.
The Arc implementation of Nuit basically just ignores encoding issues because Racket already takes care of all that. So any encoding information in the Nuit spec would have just been a nuisance.
There's already plenty of information out there about different Unicode encodings, so people can just use that if they don't have the luxury of relying on a system like Racket.
---
I see encoding as having to do with the storage and transportation of text, which is certainly important, but it's beyond the scope of Nuit.
Perhaps a Nuit serializer wants to use ASCII because the system it's communicating with doesn't support Unicode. It could then use Punycode encoding.
Or perhaps the Nuit document contains lots of Asian symbols (Japanese, Chinese, etc.) and so the serializer wants to use an encoding that is better (faster or smaller) for those languages.
Or perhaps it's transmitting over HTTP in which case it must use CR+LF line endings and will probably want to use UTF-8.
---
I'll note that Nuit also doesn't specify much about line endings. It says that the parser must convert line endings to U+000A but it doesn't say what to do when serializing.
If serializing to a file on Windows, the serializer probably wants to use CR+LF. If on Linux it would want to use LF. If transmitting over HTTP it must use CR+LF, etc.
Nuit also doesn't specify endianness, or whether a list should map to an array or a vector, or how big a byte should be, or whether the computer system is digital/analog/quantum, or or or...
Nuit shouldn't even be worrying about such things. Nuit shouldn't have to specify every tiny miniscule detail of how to accomplish things.
They are implementation details, which should be handled by the parsers/serializers on a case-by-case basis, in the way that seems best to them.
"Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?"
Well, not exactly. If all the lines are blank, that's fine too. But assuming at least one non-blank line... there must be at least one line that is unindented. There might be more than one unindented line.
"...then I'd say a single item should be put in a list too [...] And yet now Nuit values at the root can't be strings; they must be lists."
That's correct. There's an implied list wrapping the entire Nuit text. You can think of it like XML's root node, except in Nuit it's implicit rather than explicit.
Calling "readfile" in Arc also returns a list of S-expressions, so this isn't without precedent.