Arc Forum | As I understand it, Arc doesn't yet handle Unicode or anything other than ASCII....

Arc Forum

7 points by mr-anonymous 6702 days ago | link | parent

As I understand it, Arc doesn't yet handle Unicode or anything other than ASCII. Therefore if I say "Espana" (that's n-with-tilde=, or IPA like "ɬɪŋkɪt" (Tlingit), or Chinese like 中國 (China), what's the output page going to say?

6 points by pg 6701 days ago | link

Why don't you try it and see?

-----

12 points by mr-anonymous 6701 days ago | link

Step 1: figure out where/what "MzScheme" is. Google is my friend, but a link to the site wouldn't hurt.

Step 2: figure out how to get the right version. Not hard, but a step.

Step 3: get the right architecture. Also not hard.

Step 4: run $RIGHT_PATH/bin/mzscheme -m -f as.scm

Step 5: figure out how to run "webapp.arc".

Step 5a: First attempt: "mzscheme -m -f as.scm webapp.arc" doesn't work.

Step 5b: Read documentation (what, I'm supposed to read it first?) and see pointer to "blog.arc"

Step 5c: Read header at top of blog.arc which says to run '(load "blog.arc")' followed by '(bsv)'. Okay, I can do '(load "webapp.arc")'

Step 5d: Figure out that '(bsv)' is the function name to start the server, and is specific to that blog code. I need to '(asv)' instead. w00t!

Step 5e: Go to localhost:8080 and find "it's alive". Figured out that I need to go to "localhost:8080/said" to get the web interface.

Step 6: Go to newly started server. Input my home town (contains a diacritic). Oops! The diacritic disappeared. The english spelling of the city's name is not the same as the real name minus the diacritic! Try it out yourself with "Espana" - including the tilde over the n (which you won't see here because this server stripped it away). The english name for spain is not "espana".

(Step 7: Mutter when repeated ^C don't kill the program; did a ^Z; kill %% rather than the (tl) (quit) needed to exit more gracefully.)

I tried various other special characters: the symbol for British pounds (GBP) gets turned into "GBP", the Japanese yen symbol (JPY) gets turned into "JPY". A grep finds this conversion done in "latin1-hack", which is indeed a hack.

Yet upper case sigma (∑) comes back without a problem, as does the traditional Chinese for China (中國). These are encoded through '&' escapes. So why do the Latin-1 hack at all?

Hmm, and the server doesn't specify a charset ... and it doesn't escape embedded text, so if I write "<b>this is not bold</b>" the HTML tags get interpreted.

In summary, the specification says that the final page displays "whatever [was] put in the input field". Yet the given solution does not display "A <GBP>" (that's "A-with-a-circle less-than-sign British-pound-sign greater-than-sign") correctly. The output is "A<GBP>" and the unknown HTML tag is not displayed, so I only see "A".

P.S. This server's session timed out before I finished typing in all of the above so I had to start a new comment and copy&paste from the old. Somewhat annoying.

-----

3 points by mr-anonymous 6697 days ago | link

I take it that none of the arc people are worried that the arc solution to the challenge doesn't work? I can't write the symbol for the British pound or other high Latin-1 characters, and it doesn't escape correctly for display in HTML.

So far I've only seen a couple of people mention the lack of proper Unicode support and the huge XSS hole, and these were people who implemented the complete problem using some other language.

When will there be an arc program which implements the arc challenge?

-----

1 point by jmatt 6697 days ago | link

PG has already addressed character sets. arc only supports ascii.

For further information see:

http://arclanguage.org/item?id=391

http://paulgraham.com/arc0.html

JMatt

-----

1 point by mr-anonymous 6697 days ago | link

Yes, I read those. But the point of the challenge is that the last page displays "whatever he put in the input field". I tried out the supposed arc answer to the challenge and it doesn't actually display what I put into the input field.

Try writing "The first conquistador in what is now the US was Juan Ponce de Leon and the last was Don Juan de Onate Salazar." There's an o-with-acute-accent in Leon, and there's an n-with-tilde in Onate).

Try writing "Noroveirusyking a HliX", which is a headline from today's MorgunblaXiX (a newspaper in Iceland).

Try writing "Don't use the <blink> element!"

Or try writing some of the other problems I pointed out earlier. (A parent to this comment.)

  * They do not work. *

If the challenge was "... as long as the input is in ASCII and doesn't include the '<' and '>' and '&' characters" then that's different. But that's not the challenge.

At the very least, raise an exception for out-of-range characters. The current code hacks some Latin-1 characters to ASCII, others to "X", and encodes characters >= 256 to &# escape codes. This is wrong.

To which kens added that because the server doesn't set the content-type encoding, if the browser autodetects the ASCII as being utf-7 then there's another possible attack.

-----

1 point by mr-anonymous 6696 days ago | link

Now I should be able to speak properly. http://news.ycombinator.com/item?id=111100

XSi! Antligen! Tschuss!

-----

5 points by mr-anonymous 6702 days ago | link

And as you can see, this web server doesn't like non-Latin names either. I wonder if the MorgunblaXiX mentioned Paul Erdős' trip from San Jose to Koln. I heard he talked about Mobius strips while eating phở.

(Just had to try it out.)

-----

5 points by mr-anonymous 6702 days ago | link

Neat-o! Three different modes. Iceland's newspaper got a pair "X"s in the name, some Latin names with diacritics got the diacritics removed, others (the double acute in Paul Erdos) got &escaped.

-----

3 points by staticshock 6701 days ago | link

would unicode support in arc increase the length of pg's excersize program?

-----

2 points by mr-anonymous 6701 days ago | link

There are two problems with the code. One is the strange things it does to some characters (stripping diacritics, converting some graphemes to two separate letters by assuming they are ligatures, and so on). Fixing this would not change the code.

The other is that it doesn't escape '<' and '>' correctly so embedded HTML-like text gets improperly interpreted as HTML. One of the advantages of some of the templating systems is the default mode is to escape everything, making it harder to do XSS and other attacks. Fixing that might make the code longer, or not, depending on the solution.

(Just checking if I can make this <b>bold</b>. If so .. hmm.)

-----