Ideally, the 'results' in the documentation would still be computed dynamically, so you can always see whatever the current output would be, but also have the expected results available for separate regression testing.
All the talk about stdin/stdout/randomness got me realizing that, really, any state is going to be an issue (as state is wont to be). Not that it's often a problem in a mostly-functional language like Arc.
On the one (liberating) hand, Arc's philosophy clearly guides us to not lose sleep over the potential problems of state---just be careful. I see your destructive examples are careful to use ret, for instance, rather than global assignment. On the other hand, we are losing the ability to express certain kinds of tests here. Ideally, we could eval the examples in their own "sandbox" and capture their output, just as if it had been at a REPL.
This is an interesting problem, because it feels so familiar. After some slow, tired thoughts (been a long day for me), it hit me: "Oh! We're doing doctests, except in the other direction."
1. Copy/paste a REPL interaction into a docstring.
2. Looking up the documentation shows the static string.
3. A test function (doctest.testmod()) combs through docstrings, parses them apart, runs them in isolated environments, and checks their results.
Here:
1. You give some literal code as examples (yeah, Lisp!).
2. Looking up the documentation evaluates the example in the current environment.
3. Running the example generates some output that we print out as though it had been pasted from a REPL interaction.
For point 1, using sexps seems more befitting to Arc than parsing out docstrings. Yet we do lose something here: writing literal strings of expected output is cumbersome, as you mentioned. Much better would be to just copy & paste an entire REPL interaction (inputs & outputs) into a multiline string and call it a day.
Point 3 is a small detail: to run the tests, you "have" to look at the docs. Except writing a doctest.testmod() equivalent isn't hard---just evaluate all the examples without printing their docs.
But Point 2 is an interesting one. Arc doesn't have much support for environment-mangling: eval doesn't like lexical variables, lack of modules/namespaces, the difficulties of environment introspection, the weirdnesses that macros introduce, etc. About the only way I see to have a guaranteed clean slate (without requiring us to just "be careful" with the examples) is if we could fork off another Arc interpreter, run the example in that process, and capture the output (including stderr!) for comparison with the docstring. I reckon it would be slow as hell (one interpreter for each block of examples), but is that otherwise feasible?
I like the idea of just having to compare output strings universally (the repl-output comment I made), so that we don't really have to decide which expecteds to selectively evaluate. Instead, just look at the strings: (is "#hash(...)" "#hash(...)"). But there are probably ways that just comparing the output strings could break. Obvious things like random number generation, race conditions, user input. But even tables: are two tables with the same contents guaranteed to serialize the same way? I'm too tired to effectively think through or look up an answer.