I was surprised to find the day after releasing Arc that the single
biggest criticism of the language was that it didn't support Unicode.
Then I thought about it, and I wasn't so surprised. The number one
criticism of anything complicated is always going to be about the
color of the bicycle shed.
For all the people who are calling me an evil xenophobe, let me
clarify. I didn't mean Arc will only ever support Ascii. All I
meant was that I haven't bothered with characters yet. When I get
around to it, it will support whatever n-byte character set seems
to be the default by then. In the meantime, you have the source
code. If you want a different character set, stop whining and start
hacking. If you do it well I'll incorporate it in the next release.
Judging from the comments I've seen, most people will now be placated.
But there seem to be some who think it irresponsible even to
have released something that only supported Ascii. How could I do
such a thing?
I've ignored character sets so far for the same reason McCarthy
ignored them in his 1960 Lisp paper: he was focusing on something
else. Character sets are a peripheral matter. The only reason
they loom so large in the average programmer's life is that, though trivial, they're an enormous time suck.
Trivial + time consuming. Sounds like a good thing to postpone.
I can't make more time. Spending time on character sets would mean
not spending it on something else. So let me explain what I did
spend time on: making programs short.
I deliberately phrase this in a low-key way. In fact, it is the
main purpose of high-level languages. So far as that's true, doing
things to make programs shorter is (except for pathological cases)
identical with making a language good.
So what I want to focus on in the immediate future is making a
language that does a good job at what languages are for. If someone
can show me examples of things that are short in other languages
but require gross circumlocutions in Arc, that's the kind of problem
I really want to fix.
There are already plenty of programming languages that are horrible
kludges at the core, but compatible with everything. Is it so
terrible to have one where we try to do things in the other order--
to make the core clean and powerful before putting a nice coat of paint on the exterior?
This probably isn't the sort of thing that pg is talking about, but what I would like is a large and modern standard library.
I do most of my exploratory programing in python these days, not because I like it better than CL, but because I don't want to spend 6 hours tracking down a library that does X that is compatible with CL implementation Y I happen to be using. It seems like the only truely universal extension to CL since the ANSI spec was written is gray streams. The best case scenario is I find an asdf package, and it happens to work with the implementation I am using.
However, most of the time I'll install it on CLISP and it won't work, then try it on SBCL and it does (or vice-versa).
With Python, I just load up the module reference and spend 5 minutes finding the module that ships as part of the distribution that does what I need.
CL doesn't even have a standard way to connect to a UDP socket. That's a minor thing in and of itself, but I've never tried to prototype any significant program and not run into something like this.
I think the creators of Arc get this, since one of the things included is a webserver. Also, having a central clearinghouse for the language and having the implementation be the documentation both lend themselves well to allowing a large standard library to grow, so I am very hopeful.
After all, Python did not have any sort of impressive library in 1991 when Guido first publicly posted it (it did have a module system though hint-hint).
And don't forget good documentation, preferably built-in. Whenever I am exploring something I haven't done before, it's a great boon not having to read the sources, or even look it up on the web. Python's help() is wonderful.
I, for one, endorse your approach of getting the foundations right before building towers atop.
I venture in explanation of the character set controversy: IIRC You have written of wanting Arc to be good specifically for web programming. (And obviously you are using it thus.) Ascii is fine when working on code to do symbolic differentiation (as I believe McCarthy was interested in at the dawns of time). But for a web app for the Chinese market, say, Unicode is obviously going to be involved. I think perhaps people were just expecting 'new web-app language' to entail Unicode support.
"I think perhaps people were just expecting 'new web-app language' to entail Unicode support." -- I would go further: I would say that many of us consider Unicode support an essential; it's part of getting the foundations right. PG mentioned hearing/reading Guido talk about the pain of switching Python's character support--but what was painful was the switching, not the character sets. The sooner Arc makes that switch, the less pain it'll be.
Releasing a language which only supports ascii is a travesty today. If what you really meant was that characters would be opaque streams of bytes, you should have said that - the fact that you didn't indicates that you don't really understand the issues involved. If that is why you didn't write any unicode support, you should have said that, too.
But you didn't just say you didn't have any support for it, or that you didn't understand the issues and needed to figure out what best to do. You were dismissive in the extreme of the entire idea - "I don't want to spend even a day on character sets" - and for someone who pretends to be working on a new language for web development this is such a monumental lack of judgment and information that it taints the entire language. The issue with HTML tables is similar, but to a lesser degree.
If this were some college kids half-baked homework Lisp implementation nobody would give a damn. But you've been preaching on Lisp forever, and you've been talking about Arc for years, and you've talked a lot about the next 100 year language. This is what we're supposed to look at for our revolution in computing? Even Visual Basic has high quality unicode support.
Making your HTML libraries dump stuff as tables is just silly, too. It's not egregious, like saying that unicode doesn't matter, but come on now. It's not 1978 anymore. There's an expectation that you have spectacularly failed to live up to.
Also, the fact that forum tells you to create an account after you've written your comment and submitted it? Extremely crappy.
Sorry to break it to you, but you have totally missed the point. Arc is not finished, it is still an experimental language. There will be unicode suppport, but there are many core issues yet to be resolved. You shouldn't use Arc for actual projects right now anyway, as Paul already warned that future changes not only can, but will break old code. So this is not the time to worry about Unicode. It's a bit like building a house and worrying about the curtains while you're still stacking bricks...
You're confusing two things together: the kind of language that is good for creating "quick and dirty" hacks, and the kind of language that one can produce through "quick and dirty" hacks. These two are not the same. Python, for example, is the Q&D hacking language of choice for many people today, but it is not itself quick or dirty at all. That's why people use it. They love the deep library and gestalt of consistent programming UI. Those things sound simple, but as it turns out simple is harder than complex. It's easy to make something complex. Look at PHP. It's hard to make something simple. Look at Apple's products.
The goal for Arc needs to be clarified. Is it, "There oughta be a language for PG to write dirty hacks in?" or "There oughta be a language for PG to write using dirty hacks?" The Unicode decision seems like a clear indication that the latter is the case, and people who were expecting the former are unhappy about it.
speaking of hash tables, I remember in ACL you explained why CL has two return values for gethash, to differentiate between the nil meaning "X is stored as nil in my table" and the nil meaning "X is not stored in my table". So why not in Arc?
Dear Paul, I can not believe you would make such a statement. Either you're living in a vastly different programming universe than the one I am living in, or you really haven't done that much programming at all. In any case, there are many situations where one stores types of values that may include nil in a hash table, and in most of these there is a very significant difference between 'value is nil' and 'value is not stored'. I understand that Arc isn't trying to all 'enterprisey', but these are fundamental concepts that, I thought, only complete amateurs did not understand. Sincerely, Dr. Drake
You know, I do actually understand the difference between the two cases. What I'm saying is that in my experience hash tables that actually need to contain nil as a value are many times less common than those that don't.
In situations where the values you're storing might be nil, you just enclose all the values in lists.
My goal in Arc is to have elegant solutions for the cases that actually happen, at the expense of elegance in solutions for rare edge cases.
table is a hash table, k is a key, gethash returns the value in table for k or 0 if no value for k is found. Think of incf as ++. It will increment the value by 1 and set that as the value for k in table.
Oh, it definitely throws strong typing right out of the window.
The reason I suggested it is because it would seem that almost all of the time where you go to do an increment on a nil value, you're working with an uninitialized element (not necessarily in a hash map) and treating that as 0 (as you're doing an increment) would in a certain sense be reasonable behaviour.
But I guess you're right, in the case where nil does represent an error, it'll be two steps backwards when you go to debug the thing.
Why would you extend rather than subclass the Array class? It kind of confirms all of my worst fears about Ruby's too-easy class reopening. (what happens when someone else defines an Array method called "categorize" for a totally unrelated purposes?)
I think that the Python syntax for this is
h[x] = h.get(x, 0) + 1
It isn't quite as concise as the Common Lisp but more so than Arc. I'd be curious to see what the Common Lisp looks like if you are doing something more complicated than an in-place increment. E.g. the equivalent of:
A hashtable containing integer values is a common implementation for the collection data structure known as a Bag or Counted Set. The value indicates how many instances of the key appear in the collection. Incrementing the value would be equivalent to adding a member instance. Giving a zero default is a shortcut to avoid having to check for membership.
I think the phrasing of the announcement may be a large part of the problem. Not doing unicode is about the only concrete thing it says about Arc. People read it expecting to find out why Arc is going to be great. And it didn't really say, except for talking about the principles of conciseness and power. Even the tutorial doesn't say "To allow the writing of concise, powerful programs, Arc introduces features X, Y and Z." It just says "Arc (and lots of other Lisps) have features A, B, C, D, E, F..."
People don't grasp abstract principles well, and even when they do, they don't trust you to mean what they think you mean until they see some concrete evidence. Your problem is that the "unicode admission" and the stuff about the HTML library are the only solid statements in the announcement about what they can expect to find in Arc. That's what they latch on to instead of a vague promise that Arc will let them write shorter programs.
I find it utterly hilarious that people @ news.yc complain to no end that the site has non-"hacker" stuff. Then when an interesting open source project is released they complain that it doesn't have such and such a feature that they need. So are they hackers or not? What gives?
While I agree with your sentiment, it is possible that the subset of people who complain about 'non hacker' stuff on news.yc doesn't intersect with the subset of people who complain about some feature that Arc lacks....
I think this is partly just a communication problem- Dale Carnegie would advise to take a different tone. Instead of saying "politically correct", something more apologetic would might better- "I didn't have time yet- of course a better character set will be supported in the future"
Yes, I think Arc intentionally supports only ASCII just to not bother with Unicode issues as of right now.
Anyway, I can't see how Unicode can break in Arc. I'm not a Lisper, but I think you can't extract 1 byte from an Arc string (since it's just a MzScheme string), but 1 char instead. That's a different concept, because in Unicode 1 char can be formed with 1, 2 or more bytes.
Complaints take on a life of their own, If it was not Unicode it would be something else. But we do have to complain...don't we? Mine is the absence of dynamic scope (hoping I missed it). Meanwhile, omigod, your prime directive is brevity?! I worked on the same floor with the K guys at UBS, having flashbacks. That and speed was all they could talk about. :)
GvR took a year because he didn't want to break old code. PG appears to have no compunctions about doing such a thing. And not only does that makes perfect sense, but he also warned us. The moral of the story is, don't write a million-line application in Arc just yet.
Then to me this makes the whole thing a non-starter, unfortunately, because no one will want to write any non-trivial program in a language that could (will, by the creators declaration!) change in incompatible ways in the future.
One example: generic collections in Java 5. Sun went out of their way to make ensure compatibility with pre-generic collections, giving us type-erasure. Bletcherousness in the sake of backwards compatibility.
Characters are such a fundamental part of a modern, general purpose, computer language that it seems short-sighted not to allow for dealing with the issue up front.
Honestly, though, it is early enough in the game that if people wanted to hash out the specification for Unicode support in Arc, it could be done. MzScheme characters are Unicode, aren't they? Build the definition on that foundation.
Non-starter for your next production app maybe, but not a non-starter to code enough Arc to see how it compares to your other favorite languages so you can submit suggestions for improvement. If this mode of operation makes it easier to change the language for the better, I'm all for it :)
Eventually, backward compatibility will be very important, but having that too early just kills momentum IMO.
I saw you implemented function composition as what seems like 'symbol hack'. I don't know what to think about it... But in the same vein, I thought it could be possible to do the same for gensyms in macros. Something like prefixing hygienic macro variables with @
I like this idea (and the character you chose: @), but as zunz points out in http://arclanguage.org/item?id=105 , it would be convenient in some cases to have all symbols resolve at the macro definition site, rather than at the use site, except for those specified otherwise (for instance 'it' in the aif definition). If I understand correctly, this is all that hygienic macros require.
Hygiene does have the drawback that you can't bugfix a function used in a macro definition and have it just work, and that might well be more important in a language which is so young. I only mention this because I just did this, and if it hadn't worked I'd have been surprised. But that could be because most of my lisp experience is CL, rather than Scheme.
Thank you very much for the clarification! People got riled up because it sounded like you didn't want Arc to support unicode ever (or that the current support would be removed). As long as the language is in flux its not a problem.
However, fundamental Unicode support probably has to be in place before release 1.0. It will be painful to add at a later time if backwards compatibility is an issue. For example a lot of string processing code might assume that accessing characters by index is constant time. If the internal representation is changed to eg. UTF-8 this might lead to performance issues. On the other hand, if code assumes that strings are equivalent to byte-arrays, it might lead to trouble if they are changed to arrays of 32bit-values.
I believe the simplest solution is to just have characters be 32bit integers. The internal representation of a string is just an array of 32bit characters. Sure this consumes more space, but who cares? As long as strings are a type seperate from byte-arrays, encoding/decoding issues and can be handled in libraries.
I think the point is that, in the presence of combining diacritics, even 32 bits isn't enough. A character is (roughly) one "base" 32-bit code plus zero or more "combining" 32-bit codes. And equality between two characters isn't purely structural - you might re-order its combining codes or use a pre-combined code. (Not all combinations have pre-combined codes.)
I will point out that I know very little about Unicode, so I might be a bit off. I can't say that I'm even very interested in the whole Unicode debate, so long as it all gets sorted out at some point in the future.
The only reason Unicode contains combined forms is for compatibility with existing standards: you cannot invent new code points representing a novel combination of base and combining characters. The Unicode normalization forms deal with these issues.
Unicode support is a complex issue: fundamentally there are the issues of low-level character representation (e.g., internal representation) followed by library support to handle normalization and higher-level text processing operations.
True, I should have said unicode code points rather than characters. I believe the fundamentals is that strings should always be sequences of unicode code points, and shouldn't be conflated with byte arrays. The thorny issues of normalization, comparing, sorting, rendering combined characters and so on could be handled with libraries at a later stage.
If intable could be made to return the actual value in the table, I'd have no reservations, since you could
(aif (intable cache args)
(thevalue it) ; where the fn thevalue does cadr
; or whatever it needs to do
(= (table args) (apply f args)))
Scanning over the entire table twice, once to see if the key is there, and once to actually get the value, is what seemed gross to me about checking for the key first. No doubt that could be optimized by having tables keep a list of keys separately, but that seemed like a heavyweight fix.
I agree that your version is prettier on the surface, though.
It goes to show how much prose writing matters when launching a new language. If you hadn't written a paragraph about character sets in the announcement, nobody would have given you any grief about it. People will focus on what you focus on.
The current position seems like "well, you can write exploratory programs using only the numbers 0 to 7, so clearly they're the first priority".
String handling is an important feature of any programming language, and strings are series of characters. Doesn't that suggest character sets are pretty core?
I'd suggest the main reason character handling is currently such an enormous time suck for developers is that many of the languages in common use were developed before Unicode came along. If you want to design a language that will be an enormous time suck in_use now, so be it.
Supporting short programs is great - but supporting short programs that will actually work is much better. A solution that is special-cased to a limited range of values is not a reusable solution.
Try to imagine a modern language without support for capital letters, or a language that only supported 4 of the normal vowels. Sure, you'd be able to get by, but it would be a major nuisance factor. That is how important basic non-ascii characters like accents and es-zettes are to western European languages. As you go further East things just get worse, until you get to languages like Chinese which bear no resemblance to ASCII.
So yes, unicode might seem like an unimportant feature to you, it makes a huge difference to other cultures.
You are of course free to work on the features that interest you, just don't try and pretend that Unicode isn't an important feature just because you don't want to work on it.
Or just one type of string: unicode character strings (which is sequences of unicode code points). Then a seperate type for byte arrays. Byte arrays are not character strings, but can easily be translated into a string (and back).
...and this seem to be exactly what MzScheme provides :-)
Strings in MzScheme are sequences of unicode code points. "bytes" is a seperate type which is a sequence of bytes. There are functions to translate between the two, given an encoding.
Python 3000 is close to this, but I think Python muddles the issue by providing character-releated operations like "capitalize" and so on on byte arrays. This is bound to lead to confusion. (The reason seem to be that the byte array is really the old 8-bit string type renamed. Will it never go away?) MzScheme does not have that issue.
Most people seem to program applications and not algorithms, which seems to be tha case, otherwise they would recognize the spirit/intention of Arc to actually make writting programs shorter.
I suppose this is not in the priorities of other people, whereas building an application it is, therefore their need for Unicode. But I suppose this is covered by the fact as stated by you Paul, that Arc is not for everyone.
It would be interesting to see perfomance charts and implementation of data structures to see any benefits.
As arc develops, it might be worthwhile to implement subsets of it in other languages for comparison. Then some of the sample applications can be written across all these alternatives. At some point there might be a few examples or an implementation point where its infeasible (i.e., very painful) to use anything but Arc itself.
That's so funny. People want use your new language because you are an smart guy, but the same people can accept your ideas and procedures. I think if people can't wait you finish the language, then don't use Arc (sorry for my English).