Arc Forumnew | comments | leaders | submitlogin
3 points by akkartik 4622 days ago | link | parent

Thanks for the bug report. It is certainly good to know about, but I don't think the HN folks come by here much, and any fixes we come up with might well never make it back. You might want to inform them separately at info@ycombinator.com.

It looks like whitec[1] and therefore urlend[2] is not unicode-aware. I suspect that's true of all arc's string functions. I'll try to investigate, but I'm pretty ignorant of unicode; any help fixing this would be most appreciated. (Edit 25 minutes later: it turns out this was super easy to fix using racket's underlying unicode support -- https://github.com/nex3/arc/commit/052b560a2b)

[1] http://files.arcfn.com/doc/string.html#whitec; https://github.com/nex3/arc/blob/ca2b70213a/arc.arc#L1130

[2] https://github.com/nex3/arc/blob/ca2b70213a/lib/app.arc#L513



3 points by Pauan 4622 days ago | link

For reference, here are the whitespace characters as defined by Unicode:

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

And here's some Arc stuff to match them:

  (mac unicode-range args
    `(list ,@(mappend (fn (x)
                        (if (acons x)
                            (accum a
                              (for i (coerce car.x 'int) (coerce cadr.x 'int)
                                (a:coerce i 'char)))
                            (list x)))
                      args)))

  (= unicode-whitespace (unicode-range (#\u0009 #\u000D) #\u0020 #\u0085 #\u00A0 #\u1680 #\u180E (#\u2000 #\u200A) (#\u2028 #\u2029) #\u202F #\u205F #\u3000))

  (def whitec (c)
    (some c unicode-whitespace))
Also, Racket has a "char-whitespace?" function which after very minor testing seems to do the same thing:

http://docs.racket-lang.org/reference/characters.html?q=whit...

-----

3 points by akkartik 4622 days ago | link

Thanks! I just updated anarki to use char-whitespace? and friends.

-----

3 points by Pauan 4622 days ago | link

I wonder if it'll cause any problems... in particular, Arc's "punc" is obviously not trying to be comprehensive, so maybe it was designed specifically for URL syntax?

-----

1 point by akkartik 4622 days ago | link

Hmm, it was always defined in arc.arc, so I think it's intended for more than urls. Besides, why would bang be part of URL syntax? And wouldn't it also need ampersand? The original version seemed to include the characters for regular english punctuation.

Update an hour later: but I see where you're coming from; punc is only used in one function -- urlend.

Update two hours later: I found one regression in urlend: https://github.com/nex3/arc/commit/671c5ec916. We also have unit tests for markdown now, so any further regressions need only be caught once more.

-----

2 points by Pauan 4621 days ago | link

I think you're right, but for future reference:

http://en.wikipedia.org/wiki/Percent_encoding#Types_of_URI_c...

http://en.wikipedia.org/wiki/URI_scheme#Examples

-----