The benefit of interleaved tests is future-proofing, like you said. When making large, non-localized changes such as the move to a new arc, you know exactly the first test that failed. But what happens when you scale from one file to two? When multiple files of definitions depend on arc.ss?
The benefit of keeping tests in a distinct file: you can reason about the final semantics of definitions without thinking about the order in which they are orchestrated. That is a useful simplification except when making large localized changes.
I'm not certain of these ideas by any means. Just thinking out aloud. Would it be useful to analyze the tree of dependencies, and to execute tests in dependency order regardless of the order they're written in? Would that give us the best of both worlds?