Arc Forum | "A finite sequence of Unicode characters."This is a bit of a spec wormhole (as a...

Arc Forum

new | comments | leaders | submit

1 point by rocketnia 5106 days ago | link | parent

"A finite sequence of Unicode characters."

This is a bit of a spec wormhole (as akkartik calls it ^_^ ), but go with it if you feel it's right.

If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?

Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode? If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode.

Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)

---

"The only requirement is that it can hold 0 or more strings in order."

Technically it needs to hold sub-lists too, but I know that's not your point.

Zero? How do you encode a zero-length list in Nuit?

Is there a way to encode a list whose first element is a list?

Oh, come to think of it, is there a way to encode a list whose first element is a string with whitespace inside?

---

"In fact, it's actually illegal for the first sigil to be indented."

Cool. Put it in the doc. ^_^

I assume you mean the first line, rather than the first sigil. The first line could be a sigil-free string, right?

Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?

---

"I'm not against a Nuit parser/serializer using those kinds of de facto encoding schemes, but I want to keep Nuit simple, so I don't plan to put them into the spec."

I like it that way too.

---

  @attr href http://www.arclanguage.org/

Er, I think that creates the following:

  [ "attr", "href http://www.arclanguage.org/" ]

---

"But I believe making whitespace at the end of the line illegal is an overall net gain."

I agree. I'm not sure I'd make it illegal, but I'd at least ignore it.

If you make whitespace at the end of blank lines illegal, bah! I like to indent my blank lines. :-p

This is partially because I've used editors which do it for me, but also because I code with whitespace visible, and a completely blank line looks like a hard boundary between completely separate blocks of code.

1 point by Pauan 5106 days ago | link

"This is a bit of a spec wormhole (as akkartik calls it ^_^ )"

I have no clue what you're talking about.

---

"If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?"

I don't see why not...

---

"Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode?"

Invalid Unicode is not allowed.

---

"If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode."

I have no clue where you got that idea from... I'm assuming you mean that JSON is encoded in UTF-16.

UTF-16 is just a particular encoding of Unicode that happens to use two or four 8-bit bytes, that's all. UTF-16 can currently handle all valid Unicode and doesn't allow for invalid Unicode.

But JSON doesn't even use UTF-16. Just like Nuit, JSON uses "sequences of Unicode characters" for its strings. And also like Nuit, JSON doesn't specify the encoding: neither "json.org" or Wikipedia make any mention of encoding. And the JSON RFC (https://tools.ietf.org/html/rfc4627) says:

  JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

YAML should be fine too (http://www.yaml.org/spec/1.2/spec.html#id2771184):

  All characters mentioned in this specification are Unicode code points. Each
  such code point is written as one or more bytes depending on the character
  encoding used. Note that in UTF-16, characters above #xFFFF are written as
  four bytes, using a surrogate pair.

  The character encoding is a presentation detail and must not be used to
  convey content information.

  On input, a YAML processor must support the UTF-8 and UTF-16 character
  encodings. For JSON compatibility, the UTF-32 encodings must also be
  supported.

ECMAScript 5 (http://www.ecma-international.org/publications/files/ECMA-ST...) does specify UTF-16:

  A conforming implementation of this Standard shall interpret characters in
  conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
  10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
  implementation level 3.

And I believe Java also uses UTF-16.

But I see no reason to limit Nuit to only certain encodings. And if I did decide to specify a One True Encoding To Rule Them All, I'd specify UTF-8 because it's the overall best Unicode encoding that we have right now.

Instead, if a Nuit parser/serializer is used on a string that it can't decode/encode, it just throws an error. It's very highly recommended to support at least UTF-8, but any Unicode encoding will do.

---

"Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)"

The current spec for Nuit says to throw an error for byte order marks appearing in the middle of the file.

---

"Zero? How do you encode a zero-length list in Nuit?"

Easy, just use a plain @ with nothing after it:

  @
  @foo

The above is equivalent to the JSON [] and ["foo"]

---

"Is there a way to encode a list whose first element is a list?"

Yes:

  @
    @bar qux

And I was thinking about changing the spec so that everything after the first string is treated as a sigil rather than as a string. Then you could say this:

  @foo @bar qux -> ["foo" ["bar", "qux"]]
  @ @bar qux    -> [["bar", "qux"]]

---

"Oh, come to think of it, is there a way to encode a list whose first element is a string with whitespace inside?"

Yes, there's two ways:

  @ foo bar qux
  @
    foo bar qux

The above is equivalent to the JSON list ["foo bar qux"]. This follows naturally if you assume that empty strings aren't included in the list.

---

"I assume you mean the first line, rather than the first sigil. The first line could be a sigil-free string, right?"

Right, so I suppose it would be "the first non-empty line must have no spaces at the front of it".

---

"Er, I think that creates the following:"

This is true. You could use this, though:

  @attr href
    http://www.arclanguage.org/

Not as nifty, but it works.

---

"If you make whitespace at the end of blank lines illegal, bah! I like to indent my blank lines. :-p"

Nuuu. Well, maybe. I'm undecided on that whole issue.

-----

2 points by rocketnia 5106 days ago | link

"the JSON RFC (https://tools.ietf.org/html/rfc4627) "

Ah, I get to learn a few new things about JSON! JSON strings are limited to valid Unicode characters, and "A JSON text is a serialized object or array," not a number, a boolean, or null. All this time I thought these were just common misconceptions! XD

It turns out my own misconceptions about JSON are based on ECMAScript 5.

To start, ECMAScript 5 is very specific about the fact that ECMAScript strings are arbitrary sequences of unsigned 16-bit values.

  4.3.16
  String value
  primitive value that is a finite ordered sequence of zero or more
  16-bit unsigned integer
  
  NOTE A String value is a member of the String type. Each integer
  value in the sequence usually represents a single 16-bit unit of
  UTF-16 text. However, ECMAScript does not place any restrictions or
  requirements on the values except that they must be 16-bit unsigned
  integers.

ECMAScript 5's specification of JSON.parse and JSON.stringify explicitly calls out the JSON spec, but then it relaxes the constraint that the top level of the value must be an object or array, and it subtly (maybe too subtly) relaxes the constraint that the strings must contain valid Unicode: It says "The JSON interchange format used in this specification is exactly that described by RFC 4627 with two exceptions," and one of those exceptions is that conforming implentations of ECMAScript 5 aren't allowed to implement their own extensions to the JSON format, and must instead use exactly the format defined by ECMAScript 5. As it happens, the formal JSON grammar defined by ECMAScript 5 supports invalid Unicode.

---

"This follows naturally if you assume that empty strings aren't included in the list."

I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different.

-----

1 point by Pauan 5106 days ago | link

"I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different."

Then I'll change the spec to be more understandable. What wording would you prefer?

-----

2 points by rocketnia 5105 days ago | link

I'll try to make it a minimal change: "If there's anything between the @ and the first whitespace character, that intervening string is the first element of the list."

-----

1 point by Pauan 5105 days ago | link

That sounds great! I'll put in something like that, and also explain about the implicit list.

-----

1 point by Pauan 5106 days ago | link

"This follows naturally if you assume that empty strings aren't included in the list."

Before rocketnia jumps in and says something... if you do want to include an empty string, you must use ` or " which always return a string.

-----

1 point by akkartik 5106 days ago | link

> > This is a bit of a spec wormhole (as akkartik calls it ^_^ )

> I have no clue what you're talking about.

Me neither :)

Oh, I think it's http://article.gmane.org/gmane.lisp.readable-lisp/302

-----

1 point by Pauan 5106 days ago | link

Well, it's true that the Nuit spec intentionally ignores encoding issues, and thus a Nuit parser/serializer might need to understand encoding in addition to the Nuit spec. I don't see a problem with that.

The Arc implementation of Nuit basically just ignores encoding issues because Racket already takes care of all that. So any encoding information in the Nuit spec would have just been a nuisance.

There's already plenty of information out there about different Unicode encodings, so people can just use that if they don't have the luxury of relying on a system like Racket.

---

I see encoding as having to do with the storage and transportation of text, which is certainly important, but it's beyond the scope of Nuit.

Perhaps a Nuit serializer wants to use ASCII because the system it's communicating with doesn't support Unicode. It could then use Punycode encoding.

Or perhaps the Nuit document contains lots of Asian symbols (Japanese, Chinese, etc.) and so the serializer wants to use an encoding that is better (faster or smaller) for those languages.

Or perhaps it's transmitting over HTTP in which case it must use CR+LF line endings and will probably want to use UTF-8.

---

I'll note that Nuit also doesn't specify much about line endings. It says that the parser must convert line endings to U+000A but it doesn't say what to do when serializing.

If serializing to a file on Windows, the serializer probably wants to use CR+LF. If on Linux it would want to use LF. If transmitting over HTTP it must use CR+LF, etc.

Nuit also doesn't specify endianness, or whether a list should map to an array or a vector, or how big a byte should be, or whether the computer system is digital/analog/quantum, or or or...

Nuit shouldn't even be worrying about such things. Nuit shouldn't have to specify every tiny miniscule detail of how to accomplish things.

They are implementation details, which should be handled by the parsers/serializers on a case-by-case basis, in the way that seems best to them.

-----

2 points by rocketnia 5106 days ago | link

"Oh, I think it's http://article.gmane.org/gmane.lisp.readable-lisp/302 "

Yep! Sorry, I forgot that wasn't on Arc Forum.

See, I follow your links. ^_^

-----

1 point by Pauan 5106 days ago | link

Oh, I missed a question.

"Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?"

Well, not exactly. If all the lines are blank, that's fine too. But assuming at least one non-blank line... there must be at least one line that is unindented. There might be more than one unindented line.

-----

1 point by rocketnia 5106 days ago | link

What would more than one unindented line do?

If they're put in a list...

  foo
  bar
  baz
  -->
  ["foo","bar","baz"]

...then I'd say a single item should be put in a list too:

  foo
  -->
  ["foo"]
  
  @foo
  -->
  [["foo"]]

And yet now Nuit values at the root can't be strings; they must be lists.

-----

1 point by Pauan 5106 days ago | link

"...then I'd say a single item should be put in a list too [...] And yet now Nuit values at the root can't be strings; they must be lists."

That's correct. There's an implied list wrapping the entire Nuit text. You can think of it like XML's root node, except in Nuit it's implicit rather than explicit.

Calling "readfile" in Arc also returns a list of S-expressions, so this isn't without precedent.

-----

1 point by rocketnia 5105 days ago | link

Okay, fair enough. I see that was implied by the documentation's examples all along too. ^_^

-----