What can't we do with just macros in PLT scheme? Here's what I have so far:
functions return nil by default
= can do arbitrary places like common lisp
ssyntax: [], :, .
convenient indexing into lists, strings or tables
convenient dot notation for rest args
convenient destructured assignment in =, let, etc.
I myself have implemented large amounts of Arc in PLT Scheme, hitting some roadblocks, passing some of them and giving up on others. I've done the same thing, more recently and extensively, in Common Lisp. I can show you what I've done if you are interested.
Note that one could write an 'fn macro that, depending on what's in its parameter list (which is certainly available at macroexpansion time), might expand directly to a 'lambda form, or might expand to (lambda args (destructuring-bind <parameter list> args <body ...>)). And Scheme does have something called 'case-lambda, which looks like it can be used to make functions with optional parameters. I have implemented a pretty good 'fn in Common Lisp (which supports optional parameters):
(Note some things I've done: I wrote 'mac so it supports the . notation for rest parameters--which is what 'restify deals with, by the way; I wrote 'with so that you can either go (with (x 1 y 2) ...) or (with x 1 ...), since 'let is already defined; and I wrote 'dsb, for "destructuring bind".)
Also... with PLT's "Pretty Big" language, in the Languages menu (click Show Details), you can uncheck a box called "Disallow redefinition of initial bindings", and then you are free to redefine 'let and 'if and so on.
And it would, in fact, be possible to make = reach inside structures. You would just have to tell it how to do so--give it a bunch of patterns, like (= (car x) y) --> (set-car! x y). Hmm, also, it turns out there's a procedure named 'make-set!-transformer, which sounds like something we could use.
But I didn't figure out how to make = work well for assigning to variables that don't exist yet. The general variable assignment procedure in Scheme is set!, as in (set! x 3). But this is an error when x is undefined. To assign to an undefined variable, it seems you must use either namespace-set-variable-value! or define; but 'define can't be used in the middle of an expression (to allow for "block structure", so you can use 'define to make local recursive functions and stuff), and if we simply make (= a b) expand to (namespace-set-variable-value! 'a b), then that won't work when we want to modify a local variable. Here's the best I can do, but I imagine the overhead of the 'with-handlers thing would make it take obscenely long for, say, incrementing the variable i 10000 times... having said that, I tried it out. It takes 33 msec to go (set! x (+ x 1)) 100000 times, and it takes 100 msec to go (= x (+ x 1)) 100000 times. Which is not nearly as bad as I feared--likely any other code your loop executes will take much longer than the difference this makes. I guess this actually isn't a problem. Anyway, my code:
Here, the body of 'thing contains a call to 'expand-thing, which is a function that you just defined. In Scheme, if you define a function and then try to use it in the body of a macro, it won't work--when you use the macro, it will complain of an undefined function. Now, it is possible to get around this: Scheme has a special form, 'begin-for-syntax, which means "execute this code at compile-time". We could, therefore, go
(begin-for-syntax
(define (expand-thing x y) ...))
(define-macro (thing x y)
(expand-thing x y))
That would work. But, now, suppose we had a macro of our own, such as our version of 'if, that we wanted to use in the body of thing or of expand-thing. This does not work:
It will try to expand the call to 'meh and find that 'my-if is an undefined variable. Macros, defined at compile-time, do not get expanded in the bodies of things that are defined at compile-time. (Actually the PLT docs talk about "phases", rather than "compile-time" and "run-time"; it seems you can have arbitrarily many phases of syntax expansion before the final evaluation.) To make this happen, you would sorta have to go (begin-for-syntax (define-macro ...)), but then it doesn't recognize the word 'define-macro. I think it's possible to load mzscheme at earlier syntax phases so it recognizes 'define-macro 2 phases up, but then if you wanted to use a macro in the body of 'my-if, you'd need to go 3 phases up, and so on: it's just too terrible.
My best idea seems to be this: At compile-time, create a table (an assoc-list?) of macro names and their body-functions. Then make your 'mac and 'def and all the other things you want evaluated at compile-time, make them macroexpand their bodies. Do this by checking if it's a list: if so, check if the car is the name of a macro: if so, apply the macro's body-function to the cdr, and repeat; otherwise, leave the list itself as is, but look at its elements (recursively) and macroexpand them. But what if this macro call is inside a quote, or a quasiquote? Oh god. You have to write a whole goddamn code-walker to make this work. I stopped at that point.
Common Lisp works fine on the macros issue. It also agrees with Arc about nil. And SBCL is pretty damn fast. But it has a separate namespace for functions, which is pretty non-foolable with, and it seems rather unwilling to allow redefinition of 'if and 'let and 'map (they call it 'mapcar) and so on. I tend to switch between Common Lisp and Arc these days.
The general variable assignment procedure in Scheme is set!, as in (set! x 3). But this is an error when x is undefined.
I just noticed compile-allow-set!-undefined the other day:
Welcome to MzScheme v4.2.1 [3m], Copyright (c) 2004-2009 PLT Scheme Inc.
> (set! a 3)
set!: cannot set undefined identifier: a
> (compile-allow-set!-undefined #t)
> (set! a 3)
>
fallintothis: I'm aware of the destructuring use of the Arc 'let. But I find that when I want destructuring, I usually want to go (mapcar (fn ((a b)) ...) (tuples 2 xs)), and I rarely use an explicit 'dsb; if I want it, though, it's there. My version of 'with is all I need most of the time.
I am trying out your "arc.lisp" now. Seems great so far.
There is much exploring to do, but my initial confusion with =. Seems it's not working and maybe not even defined. I read your discussion above of the problems with =, but I thought that was about your Arc in PLT, not CL. I haven't looked closely at the guts of "arc.lisp" yet, so maybe that will answer my question.
Oh, I forgot to include that in the list of idiosyncrasies. Basically, use setf wherever you would use = in Arc. The reason, in short, is that = is already the numerical-equality function, and CL doesn't like people to redefine things.
In PLT Scheme, you just have to uncheck a box that says "Disallow redefinition of initial bindings" and then you can redefine almost everything, but in CL, it will complain about breaking a package lock and go into the debugger whenever you try to redefine a built-in function, of which = is one. It's possible to tell it "Yes, ignore package lock", but I don't want to do that for every single function every time I start up SBCL. I think it is possible to tell it to automatically ignore the lock... But this is the way it is right now. Also, when you try to redefine 'if, you just run into a brick wall:
The special operator IF can't be redefined as a macro.
I stumbled on one other workaround, too... you can create a new package and choose not to import (all of) the stuff from CL-USER. So far, though, this doesn't even import the symbol nil, and when I do import it, it gets printed with a "COMMON-LISP:" prefix.
I guess one might be able to get used to that. Might try it out if I feel adventurous. For now, put up with using setf, mapcar, iff, and my version of with.
Gahd, that was a fun read. I'd love to hear about your experiences with SBCL[1]. I've been porting my app over to it for the last 24 hours, lazily taking arc functions/macros as I go. My approach so far is to put up with &rest, extra parens in let, explicit destructuring-bind[2], etc.
[1] Perhaps offline; email in my profile; did you know the email field itself is not public?
[2] I wrote a bind macro variant a few years ago (http://akkartik.name/lisp.html), I may start using that. Still a little more verbose. Hmm, if I eliminate multiple-value-bind I can get rid of :db..
The real issue here is any use of nil. There's a lot of mangling in ac.scm to make sure there are nils using things like ar-nil-terminate.
ssyntax: [], :, .
Actually, [] syntax is already defined by reader macros in brackets.scm.
convenient dot notation for rest args
This is Scheme syntax, too.
> (define (f . args) (car args))
> (f 'a 'b 'c)
a
Scheme doesn't have Arc's (o ...) optional parameters, though.
What can't we do with just macros in PLT scheme?
Do you mean adding things on top of PLT Scheme (e.g., with macros) to turn it into Arc (plus whatever else PLT Scheme defines) so that there's no separate "compilation" step? I.e., Start an mzscheme REPL, punch in these definitions, and then start coding like you're in Arc?
Or do you mean more like "what can Arc do that Scheme can't"?
I assume the former, given your title, but I'm unsure to what end this would be.
For one thing, you'd be failing to separate the two languages. It's already pretty muddled: Arc inherits things like read from Scheme, so
arc> #t
#t
arc> (if #t 'a 'b)
a
arc> (if #f 'a 'b)
b
As mingled as the Arc-side and Scheme-side are now, there are some restrictions; e.g., Arc is vector-agnostic (probably because it uses vectors to implement type-tagging) and only knows about certain Scheme functions, but not others (cf. xdef in ac.scm).
arc> #()
Error: "Bad object in expression #()"
arc> first ; Scheme synonym for car
Error: "reference to undefined identifier: _first"
arc> car
#<procedure:car>
If instead "everything is Scheme", you wouldn't have those sorts of "hey, I'll whine if you use these" restrictions. Or is that the aim?
Other difficulties that come to mind are where Arc features step on Scheme's toes directly. For example, atstrings
arc> (declare 'atstrings t)
t
arc> "@(+ 1 2)"
"3"
or re-kajiggering if to accept multiple clauses.
In the limit, you could probably do everything in PLT Scheme (Turing-complete, yadda yadda), it's just a matter of bending over backwards enough: reader macros for atstrings, (define-syntax if ...) the right way, etc.
Or, y'know, just figure out how to make mzscheme's evaluator pass everything through ac first. Ideas for implementation: tl2 in ac.scm... ;)
"Do you mean adding things on top of PLT Scheme to turn it into arc, so that there's no separate "compilation" step?"
Yes.
"I'm unsure to what end this would be."
The reason I ask: I'm running into performance issues with readwarp. I can't have more than 3 users at a time before I start seeing the occasional timeout. At 5 users timeouts become common.
PLT scheme has a GIL and can only run one thread at a time. I think I'm running into that limitation this quickly because arc is a factor of n slower from the rewriting. Caching isn't an option like for HN because it's literally picking one story at a time to render.
So I may have to move reluctantly to a different scheme, or even to common lisp. I'm wondering how much of my house I can carry on my back to the ghetto :)
Even in the context of PLT scheme it seems a valid question to ask why we need this compilation step. Is it buying us enough to justify itself? So far it seems like what we get is nil-termination, optional args and parallel/destructuring assignment. I think a library of macros without those features would still seem mostly like arc to me, and it would be significantly faster.
You should try and see what subset of the language you actually use, and figure out how hard that would be to implement directly on top of scheme.
I don't really know, but the main thing that I see ac providing is the ability to do arbitrary transformations on the first element of a list before it's passed to scheme. This allows functional position definitions for all kinds of objects, and could allow for lots of different kinds of function dispatch (such as dispatch on second arg, which I think pg was planning)
Also, ac (and arc in general) give us a succinctly defined language whose source fits in only two files of less than 1k lines each. This means that the language is easy to learn about and from, and easy to modify. I don't know how many times I've thought "gee, I wish arc could do something cool, like be able to print the source code for global functions" and just gone to the source and added it.
Mind you, that doesn't preclude macros, and in fact arc already is in some ways defined as a set of macros over scheme. It's just that one of those macros is recursive, and visits all of the elements of the source tree ;)
But if performance is what you're after, defining the minimal palatable subset of arc as macros on scheme may be the way to go. It won't be as nice, but it might be a lot faster.
Why not try it, and see? Arc isn't that big, and at least three implementations have been written already.
"the main thing that I see ac providing is the ability to do arbitrary transformations on the first element of a list.."
Ah yes. Everytime I try to enumerate them I miss one. But this page has the whole list :)
"ac (and arc in general) give us a succinctly defined language whose source fits in only two files of less than 1k lines each."
Oh, I'm extremely sensitive to that, believe me. That's what's awesome about arc, and it's kept me using it full-time for 9 months now. Making it a layer of macros would actually not detract from that IMO.
As I've done more with arc I've had to deal with the underlying platform anyway. So I feel like there's no point even trying to abstract that away too much.
"Why not try it, and see? Arc isn't that big, and at least three implementations have been written already."
Yes. You know, in any other circumstances I would consider it a fun experiment. But since I started working on readwarp full-time I constantly worry that I'm going to spend time and burn cash on what's fun rather than what takes the product forward. I'll prob still end up doing this, but just agonize a lot more before I do.
Well, if speed is your main requirement, and you have an actual app that you want to run faster, I would examine it and determine a) what subset of arc you use and b) what subset of arc you need. If it's most of arc, then rewriting it in scheme might be pretty hard.
Of course, first determine how much speed you actually need. And as aw said, rewriting in a faster language won't necessarily help you scale much anyway. Finding where you need the extra speed is definitely a prerequisite to figuring out the best way to get it.
Ultimately, we can't really tell you what to do, because we don't know nearly as much as you do about the needs you have for your application, the bottlenecks that are occurring, or how much time it would cost to re-implement your system on top of scheme directly.
Lisp is great for bottom up design, but whether you can get all the way up to your application before reaching arc first is not something that we can't really judge from here. Certainly some subset of arc can be built as mere macros over scheme. But only you can determine whether that subset is comfortable enough to use, and fast enough to solve your problems.
I don't understand your comment that caching isn't an option because it's picking one story at a time to render. You have a bunch of feeds containing a bunch of stories. Some other background thread / process / program / box can fetch stories, render them, and store the rendered story to disk. Now all your user interface has to do is choose which story to present to the user.
In most programs that are too slow it's a small percentage of the code which is causing the slowness. If you work on making all of your code faster, you're wasting 95% of your effort, because it's only 5% of the code which is causing a problem. So it's worthwhile to run timing tests and figure out what exactly is the slow part, and work on algorithms such as caching to make those parts fast enough.
If you can't think of any algorithms or methods to handle the parts that are too slow, then you can rewrite your program (either the whole program or just the slow parts) in a faster language. But this will only take you so far. If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users? At some point you need to figure out what your strategy is for handling lots of users, and rewriting in a faster language is only a temporary fix for that issue.
"If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users?"
We're talking about 60 concurrent users. If I could do 60 requests per second I'd be a happy man. 60 requests per second is 216k requests/hr. Assuming 30 page views per happy readwarp user, that's 7200 uniques per hour. Let's say daily uniques at that point is 6 times that, 40k uniques per day. With 40k uniques a day I'd be a happy man. More importantly, 40k uniques will take me n years to get to. So you see, I don't need to "scale beyond 60 users".
Right now readwarp isn't even at 3 requests per second because requests can take several seconds with contention. But that's not ideal either. And it's largely because of the language and the platform.
What you describe in the first paragraph is not caching, it's precomputation. It may help, but it's a completely different performance calculation. The reason caching works for HN is that even if it gets 10k users in 5 minutes it has to generate a dozen pages once that they all see. When I have to manage 100x as many pages in the 'cache' I have to be a lot more careful not to waste time on stories that never get shown.
Precomputation doesn't actually save very much work at all. All it does is move latencies around, often by throwing hardware at the problem. And it can easily go off the rails and end up doing a lot of wasted work without helping latency.
Precomputation does help sometimes. In fact I use it in readwarp in a couple of places. But it's not as straightforward a cost-benefit analysis as caching would be. When caching works it's just brain-dead to reason about. With precomputation I have to constantly monitor whether I'm getting the bang for my buck. And I have to be conservative.
The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it.
Hmm, I probably haven't fully explored all these ideas. Every now and then my server just hangs for a few seconds and becomes completely unresponsive, and I haven't been able to figure out what it's doing at that point, to get a stack trace or something. For any timing instrumentation I insert, times often vary wildly, by an order of magnitude. I'm starting to suspect PLT's timings shoot through the roof because of contention between threads.
I should just try to offload all my state in a way that can handle multiple coexisting arc interpreters. I'll probably do that. Arc seems quite memory-heavy, though. I'm not sure how many interpreters can coexist on a machine.
You're right that I'll still have to scale in another language, but language features can make my life easier or harder when I have to scale to multiple machines. Having a global interpreter lock perhaps makes it hard to avoid thread contention, and to balance responsiveness with throughput. I know the python folks have had similar issues. OTOH, having just one request per interpreter makes arc far more inefficient in RAM than say mongrel.
"The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it."
Actually there's 3 stages. Selecting a story (A), rendering it (B), and updating user data structures when the vote on it returns (C). Each user request is seeing the total time for C+A+B.
When I eliminate rendering (the server does all the computations, but doesn't send the story to the client) I still see timeouts with 3 concurrent requests.
---
Ah, I got an improvement. Remember those 'couple of places' I was doing precomputation? I was trying to batch up stage-A computations 5 at a time in a precompute thread when a user's running low on stories. I got rid of them, and the average latency for everyone went up from 1s to 3-4s even with just one concurrent request at a time. But I can now handle 6 concurrent requests without timing out.
When I eliminate stage C the number starts inching up to 10. At 10 median request time is 13s. But at least stuff is getting processed.
My lessons:
1. When the interpreter is single-threaded try to minimize use of threads. Contention gets expensive fast. I am now used to seeing processing latency double when #concurrent requests doubles.
2. When things are slow resist the temptation to precompute. It always makes hot spots harder to see in the future. It's a last resort. And it'll always cost you on a single-threaded platform. I'm now going to rerun my profiling.
3. When you're stuck come blather about your problems on the arc forum. Thanks guys.
So I can vary the number of assignments each op does in the URL.
Experiment 1: just a hundred sequential requests.
$ ab -n 100 "http://127.0.0.1:8080/?iters=10"
On my laptop 98% of requests complete in 3ms.
Experiment 2: let's inch up to 2 concurrent requests at a time.
$ ab -n 100 -c 2 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require at least 4ms. For 98% it's 10-15ms.
Experiment 3:
ab -n 100 -c 10 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require 20-40ms.
In general the 50% latency number that apacheBench returns seems to increase linearly with the number of concurrent requests. This rule of thumb has held in my app as well over several major changes. Right now readwarp requests take 800ms on localhost and 2-3s from the internet at large. Not much room for those latencies to grow linearly.
I can't compare your first and last example because one says 98% and the other 50%.
If it were an apples to apples comparison would you agree that it wouldn't be surprising that if the server were handling 10x the number of simultaneous requests that each request would take 10x as long?
The only apples to apples comparison is going from one simultaneous request to two simultaneous requests which has the increase from 3ms to 10-15ms, which seems like about double what we'd expect if using threads had no overhead at all.
When I give just the 98% number I meant the 50% number stays the same (+/- 1ms). We're at the limits of ab's time resolution here.
Yes, none of this is surprising given that PLT is single-threaded. I'm trying to remember my class in queuing theory.. If each request takes m ms to process, n concurrent requests will take mn ms to process, (n-1)m ms of queuing delay and m ms of processing time. The average latency seen would be mn/2 which scales by n. As the number of concurrent requests grows, the curve of latency to number of concurrent requests would be linear.
If you now add a second queue, the curve stays flat until 2 requests and then moves linearly up. Each queue is only n/2 long, so average queuing delay is (n-1)m/4 ms. With q queues, on average there's only (n-1)m/2q ms of queuing delay.
Summary: if you can have multiple truly-concurrent threads, you'll be able to overlap that many concurrent requests without any increase in latency, and the slope of the curve past that point will be inversely proportional to the number of threads. Does this make sense?
---
After I wrote my previous comment I started looking in greater depth into why I'm taking 800ms to process a request. I'm really not doing that much. I'll post an update with what I find. Perhaps I haven't spend enough time attacking m before looking to expand q.
When you say that PLT is "single-threaded", is what you're referring to is that PLT only uses one CPU?
Well, yes, if you have 10 CPU's then you can handle 10 simultaneous requests in the same elapsed time as 1 CPU can handle 1 request. If you don't have any overhead in communication between CPU's, locks, etc.
I think it's clearer to talk about having one CPU or multiple CPU's instead of "truly-concurrent threads" or "thread contention".
You can have multiple threads interleaving on a CPU, and in some systems that support more than one CPU, a thread can run for a while on one CPU and then get moved to running on a different, less busy CPU.
Now if you said that that on one CPU 100 requests one at a time took a total of 300ms to process but 100 requests running interleaved on one CPU took a total of 800ms to process, then we would have evidence that there's too much overhead in thread context switching or some other problem.
I think the quality of the thread implementation has a role to play. When a process waits on IO the OS will find something else to do. I don't know if user-level threads can be that smart, and I don't know if PLT threads are that smart.
Without casting aspersions, it's not just about number of processors. Very few programs avoid IO altogether, and if there's IO some programs will be more clever than others in moving work around.
In a properly working thread system, all threads that aren't waiting for something should be given a fair slice of the available CPU time. In particular, no thread that is ready to run should be blocked or delayed because some other thread is waiting on I/O.
If you are wondering if PLT threads work correctly, isn't this rather easy to test? Do an A/B test with some threads that do some computation with and without some other threads waiting on I/O and see if the runtimes are the same.
It seems like things are straightened out here, but I thought I'd point out that the upcoming version of PLT allow multi-core parallelism via 'future':
Also: PLT does not have a GIL, but (futures aside) it has a similar mechanism that works better for implementing continuations (and worse for interoperating with C code)
You can do all of these as macros (etc) rather than a compiler. For example, an `fn' macro would inspect the body and plant `nil' return values; for arbitrary `='s see swindle's `setf!', the convenient indexing is easily done with generalized functions, rest notation and destructuring also exist in plt and would be easy to incorporate. The difficulties lie more in the subtle points: for example, plt macros have phase separation, and arc is more lispish with a single phase for everything -- and doing that would be very difficult at best. As another example, mutable cons pairs are hacked onto arc (when using the newer plt versions) in a way that works out because arc is essentially serializing its own lists -- but if you use more plt-isms, and end up using more consed lists, you are more likely to run into problems with stepping over the plt assumption of pairs being immutable.
I was interested in the use of ? as a convention for predicates [http://arclanguage.org/item?id=2998]. Seems like Arc still doesn't have a consistent convention for that (and isn't ? still unclaimed territory?). On ! as a convention for setters, I use '= as an almost universal setter, making various specialized setters seem a bit unnecessary. For example, in most cases I'd probably do
(= foo.0 x) ; instead of (scar foo x)
Maybe '=car, '=cdr would be better than 'scar, 'scdr. The s-family just doesn't make much sense anymore since 'assign replaced 'set in its traditional role. Same thing for 'safeset. I'd change it to 'safeassign but it's so long. I thought about 'safe= but '= and 'assign aren't really equivalent (which may actually undermine my argument for '=car).
Yeah, if I rewrote arc from scratch I'd def use the ?/! convention more thoroughly. Your other suggestions seem more subjective, though :) I don't want to go overboard with consistency.
I'm thinking about forking arc, perhaps with an implementation in erlang/lfe (erc?) that would give us threading and pattern-matching and erlang's great scheduler for free.
One of the key core features I find myself wanting is generic functions so annotated types can easily overload primitives like prn and len. If I write an implementation of skip lists or doubly linked lists, for example, it's a royal pain to always remember to (do1 nil ..) at the repl because prn hangs in the presence of circular pointers or pointer explosion.
> Your other suggestions seem more subjective, though :)
I know, I'm afraid I'm earning a reputation here. ;) But out of curiosity, which ones were you thinking of? This point I feel strongly about, though I may have incomplete information:
> The s-family just doesn't make much sense anymore since 'assign replaced 'set in its traditional role.
My understanding is that at some point 'set became the new 'assert and 'assign the new 'set [http://arclanguage.org/item?id=9408]. The #\s in 'scar, 'scdr, 'sref and the "set" in 'safeset are presumably references to 'set, which no longer has anything to do with these functions (and macro). I know it's not a big deal, but isn't it an onion and potential source of confusion?
Yeah that was the one. scar/scdr don't have to make more sense than car/cdr. And they're kinda internal details, same as assign. Most of the time we'll just use (= (car ..) ..). That'll do, no?
I guess so, but changing 'set to 'assign would seem to indicate some sensitivity about the internal details on the language designer's part, and I think they are more important in code-as-spec languages like Arc. Granted you're right the naming of 'scar etc. isn't as important as the naming of commonly used forms like '=.
I believe set was renamed to assign to free up the name for a better use.
The way I think about it: code as spec implies the implementation should be concise and easy to read, and externally exposed names should be approximately hamming encoded. Internal names can be longer without compromising conciseness (which is measured in tokens) or readability.
It's not black and white, just relative priorities.