Arc Forumnew | comments | leaders | submitlogin
Regular expressions in Arc (
4 points by dido 3498 days ago | 16 comments

4 points by shader 3498 days ago | link

One thing I've been thinking about recently, and might be handy for implementing regex or other features would be to allow macros access to the string contents of their original invocation. I've been in favor of meta-code features like that in order to enable help functionality and exploratory programming features, like 'src on anarki, but it might help for allowing users to write their own "reader macros" because, within the scope of the call, any macro can become a reader macro. Then, as long as you have decent string manipulation and parser library, it would be easy to just add a regex library that allows you to call (re /a?/), and add custom regex syntax that way. That or just make a standard way to add special characters to the read macro list like the #* set in racket.

How are you all currently supporting read macros, and what is your opinion on the best way to do so?


2 points by akkartik 3498 days ago | link

Pauan has a Pratt parser baked into his language ( while my attitude for wart has been "if you want to change the syntax, hack the parser". Not to be flippant; my goal explicitly is to make the code dead simple for anybody* to hack on. Not there yet, but I'd love for you to take a stab at your regex idea with it :) Perhaps we could pair on it sometime?

* who knows C. Other terms and conditions may apply.


2 points by Pauan 3498 days ago | link

I would like to point out that although it's a Pratt parser, it's been specifically modified to work better with Lisps, meaning it operates on lists of symbols rather than on a single expression. I have not seen another parser like it.

This makes it much more powerful while also being much easier to use. Using the Nulan parser, adding in new syntax is as easy as writing a macro!

For instance, in Nulan, the "->" syntax is used for functions:

  foo 1 2 3 -> a b c
    a + b + c
The above is equivalent to this Arc code:

  (foo 1 2 3 (fn (a b c)
    (+ a b c)))
And here's how you can implement the "->" syntax in Nulan:

  $syntax-rule "->" [
    priority 10
    order "right"
    parse -> l s {@args body}
      ',@l (s args body)
As you can see, it's very short, though it might seem cryptic if you don't understand Nulan. Translating it into Arc, it might look like this:

  (syntax-rule "->"
    priority 10
    order "right"
    parse (fn (l s r)
            (with (args (cut r 0 -1)
                   body (last r))
              `(,@l (,s ,args ,body)))))
The way that it works is, the parser starts with a flat list of symbols. It then traverses the list looking for symbols that have a "parse" function.

It then calls the "parse" function with three arguments: everything to the left of the symbol, the symbol, and everything to the right of the symbol. It then continues parsing with the list that the function returns.

So basically the parser is all about manipulating a list of symbols, which is uh... pretty much exactly what macros do.

Going back to the first example, the "parse" function for the "->" syntax would be called with these three arguments:

  1  (foo 1 2 3)
  2  ->
  3  (a b c (a + b + c))
It then destructures the 3rd argument so that everything but the last element is in the variable "args", and the last element is in the variable "body":

  args  (a b c)
  body  (a + b + c)
It then generates the list using "quote", which is then returned.

Basically, it transforms this:

  foo 1 2 3 -> a b c (a + b + c)
Into this:

  foo 1 2 3 (-> (a b c) (a + b + c))
As another example, this implements Arc's ":" ssyntax, but at the parser level:

  $syntax-rule ":" [
    priority 100
    order "right"
    delimiter %t
    parse -> l _ r
      ',@l r
So now this code here:

  foo:bar:qux 1 2 3
Will get transformed into this code here:

  foo (bar (qux 1 2 3))
I've never seen another syntax system that's as easy and as powerful as this.

Oh yeah, and there's also two convenience macros:

  $syntax-unary "foo" 20
  $syntax-infix "bar" 10
The above defines "foo" to be a unary operator with priority 20, and "bar" to be an infix operator with priority 10.

Which basically means that...

  1 2 foo 3 4  =>  1 2 (foo 3) 4
  1 2 bar 3 4  =>  1 (bar 2 3) 4
Here's a more in-depth explanation of the parser:

Nulan's parser is powerful enough that almost all of Nulan's syntax can be written in Nulan itself. The only thing that can't be is significant whitespace.

Even the string syntax (using "), whitespace ( ), and the various braces ({[]}) can be changed from within Nulan.


1 point by shader 3496 days ago | link

I'm actually fairly convinced that my original idea as stated above wouldn't work, because without specific syntactical support already built in for whatever new reader you're trying to add, the original reader would have no good way of knowing when your macro ended, i.e.:

  (re /\)*/)
How does the original reader know to pass all of "/\)*/" in to the 're macro? Maybe there's some clever way to tell the difference between "escaped" special characters and normal ones, but it would limit the possibilities on the temp reader.

Maybe one option would be to have more flexibility when defining macros in the first place by taking advantage of the fact that macros can be defined and evaluated at the reader stage, so they can specify their own read semantics for their evaluation if they choose. I.e. make it so that macro definitions can specify a more generic level of control than just unevaluated, pre-parsed s-exps. That would make them more of scoped 'reading macros' that shadow the existing reader, rather than 'reader macros' that just hook into it.


3 points by rocketnia 3495 days ago | link

Your train of thought is very similar to a factor in several of my designs over time: Jisp[1], Blade[2], Penknife[3], Chops[4], and now the syntax I'm planning to use with Era[5].

I've always used use bracket nesting to determine where the syntax's body begins and ends. This way, most code requires no escape sequences, and the few times escape sequences are necessary, at least they stand a chance of being consistent when cutting-and-pasting across staging levels.

  (re /\)*/)       ; Broken.
  (re /(?#()\)*/)  ; Fixed with a regex comment. (Does this work?)
  (re /\>*/)       ; Fixed by redesigning the escape sequence.
Reader macros are a different approach, where the entire rest of the file/stream has unknown syntax until the reader reaches that point. That's simple in its own way, but I prefer not to make my code that lopsided in its semantics I guess. :)

(EDIT: Half an hour after I posted this, I realized I had a big mess of previous drafts tacked onto the end. I've deleted those now.)


[1] Jisp was one of the first toy languages I made, and it was before I programmed in Arc (or any other lisp). When it encounters (foo a b c), it resolves "foo" to an operator and sends "a b c" to that operator.

  > (if (eq "string (doublequote string)) (exit) 'whoops)
  [The REPL terminates.]
[2] Blade didn't get far off the ground, but it was meant to have a similar parser, with the explicit goal of making it easy to combine several languages into a single compiled program, with an extra emphasis on having no accidental code-order-dependent semantics at the top level. I switched to square brackets-- [foo a b c] --since these didn't require the shift key.

      import com.rocketnia.blade.*
      define( [ "out", "sample-var" ], BladeString.of( "sample-val" ) )
[3] Penknife was meant to be a REPL companion to Blade, and I developed the syntax into a more complicated, Arc-like combination of infix and prefix syntaxes ( Penknife was complete enough that I used it as a static site generator for a while. However, at this point I realized all the custom syntax processing I was doing was really killing the compile-time performance.

  arc.prn.q[Defining fn-iferr.]
  [fun fn-iferr [body catch]
    [defuse body [hf1:if-maybe err
  arc.prn.q[Defining iferr.]
  [mac iferr [var body catch]
    qq.[fn-iferr tf0.\,body [tf [\,var] \,catch]]]
[4] Chops is a JavaScript library that achieves Blade-like parsing without any goals for infix treatment or general-purpose programming. I use it as a markup language and a JavaScript preprocessor now that my static site generator runs in the browser.

  $rc.rcPage( "/", $cg.parseIn( [str RocketN[i I]] ),
      "19-Nov-2012", $cg.parseIn( [str 2005[en]2010, 2012] ),
      { "title": ", Virtual Index of Ross Angle",
          "breadcrumbs": $cg.parseIn(
              [str RocketN[i I] Virtual Index of Ross Angle] ) },
      $cg.parse( [str
  ((This is the open source version of my site.
  [out The online version] has a bit more
      ] ) )
[5] Era is a module system I'm making, and I intend to compile to those modules from a lisp-like language. I've switched back to parentheses-- (foo a b c) --because smartphone keyboards tend to omit square brackets. The code (foo a b c) parses as a four-element list of symbols in the hope of more efficient processing, but the code foo( a b c) parses as a single symbol named "foo( a b c)".


1 point by akkartik 3495 days ago | link

The bouncing between parens and square brackets is interesting ^_^ I weakly feel it's not worth optimizing for what's easy to type because things can change (or be changed, with keybindings, etc.) so easily. Better to optimize for how things look. But even there, parens vs brackets is in the eye of the beholder.


1 point by akkartik 3495 days ago | link

"I've always used use bracket nesting to determine where the syntax's body begins and ends."

I have no idea what you mean by bracket nesting, or by staging levels. It also wasn't clear what the escape sequence is in the third example.


3 points by rocketnia 3495 days ago | link

"bracket nesting"

Whoops, I can't believe I didn't use the phrase "balanced brackets" instead. ^_^

The following two pieces of text may be similar, but I'd give them significantly different behavior as code:

  (foo a b (bar c d) (baz e) f)
  (foo a b bar c d) (baz e f)
My systems don't provide any way to pass the string "a b bar c d) (baz e f" to an operator.


"staging levels"

Staged programming is where a program generates some code to run later on, perhaps as a second program in some sense--especially if that boundary is enforced by a need to serialize, transmit, or sandbox the second program rather than executing it here and now. Staged programming has some implications for syntax, since it's valuable to be able to see the code we're generating.

Most languages use " to denote the beginning and end of a string, so they can't also use " to represent the character " inside the string. This can makes it frustrating to nest code within code. I'll use a JavaScript example.

  > eval( "eval( \"1 + 2\" )" )
  > eval( "eval( \"eval( \\\"eval( \\\\\\\"1 + 2\\\\\\\" )\\\" )\" )" )
While all these stages are JavaScript code, they all effectively use different syntax. It's not easy to copy and paste code from one stage to another.

Suppose we identify the end of the string by looking for a matching bracket, possibly with other pairs of matched brackets in between. I'll use ~< and >~ as example string brackets.

  > eval( ~<eval( ~<1 + 2>~ )>~ )
  > eval( ~<eval( ~<eval( ~<eval( ~<1 + 2>~ )>~ )>~ )>~ )
This fixes the issue. The same code is used at every level.

In JavaScript, technically we can implement delimiters like these if we're open-minded about what a delimiter looks like. We just need a function str() that turns a first-class string into a string literal.

  > str( "abc" )

   Open string:  " + str( "
  Close string:  " ) + "

  > eval( "eval( " + str( "1 + 2" ) + " )" )
  > eval( "eval( " + str( "eval( " + str( "eval( " + str( "1 + 2" ) + " )" ) + " )" ) + " )" )
Now the code is consistent! Consistently infuriating. :-p

In Arc, we delimit code using balanced ( ). The code isn't a string this time, but the use of balanced delimiters has the same advantage.

  > (eval '(eval '(eval '(eval '(+ 1 2)))))
This advantage is crucial in Arc, because any code that uses macros already runs across this issue. A macro call takes an s-expression, which contains a macro call, which takes an s-expression....

Since we're now talking about macros that take strings as input, let's see what happens if Arc syntax is based on strings instead of lists.

  > (let total 0 (each x (list 1 2 3) (++ total x)) total)

  > "let total 0 \"each x \\\"list 1 2 3\\\" \\\"++ total x\\\"\" total"
If we use balanced ( ) to delimit strings, we're back where we started, at least as long as we don't look behind the curtain.

  > (let total 0 (each x (list 1 2 3) (++ total x)) total)
If you want working code for a language like this, look no further than Penknife. :)


"It also wasn't clear what the escape sequence is in the third example."

Are you talking about this one?

  (re /\>*/)
The original code would be broken in my approach because it uses ) in the middle of the regex, so the macro's input would stop at "/\". This fix addresses the issue by using a hypothetical escape sequence \> to match a right parenthesis, rather than using the standard escape sequence \).

If you're talking about my Penknife code sample, the "qq." part is quasiquote, and the \, part is unquote. Quasiquotation is relatively complicated here due to the fact that it generates soup, which is like a string with little pieces floating in it. :-p Penknife has no s-expressions, so it was either this foundational kludge or the hard-to-read use of manual AST constructors.

It's hard to count these examples with a whole number. XD Let me know if you were talking about my Blade code sample (the third code block in the post) or my Jisp code sample (third if you count the two example regex fixes separately).


1 point by akkartik 3495 days ago | link

Super useful, thanks. Yes, you correctly picked the code I was referring to :)

That issue with backslashes you're referring to, I've heard it called leaning toothpick syndrome.


2 points by rocketnia 3495 days ago | link

"Yes, you correctly picked the code I was referring to :) "

Which of my guesses was correct? :-p


"That issue with backslashes you're referring to, I've heard it called leaning toothpick syndrome."

Nice, I hadn't heard of that!

It might be worth pointing out that the LTS appearing in my examples is more pronounced than it needs to be. The usual escape sequence \\ for backslashes creates ridiculous results like \\\\\\\". If we use \- to escape \ we get the more reasonable \--" instead, and then we can see the nonuniform nesting problem without that other distraction:

  > eval( "eval( \"1 + 2\" )" )
  > eval( "eval( \"eval( \-"eval( \--"1 + 2\--" )\-" )\" )" )
Here's another time I talked about this:


2 points by dido 3496 days ago | link

I ran into a similar issue with regex syntax when attempting to incorporate it into Arcueid's reader. There seems to be no easy way to parse a regular expression using Perl-like /.../ syntax, not if you also want symbols that use /'s for other things, e.g. the division function. Arcueid thus uses for now, r/.../ for regular expressions, and that syntax could be more easily distinguished from other legitimate uses of symbols with a minimum of fuss.


1 point by akkartik 3495 days ago | link

Wart's tokenizer already knows about backslashes inside strings, so "\"" becomes one token. It seems plausible to try tokenizing in a backslash-aware way everywhere and not just inside strings. Other than that you would have to treat slashes as a delimiter like double-quotes.

It might be an ugly design, but I think it would work, and it would be worth trying out.


3 points by akkartik 3498 days ago | link

Kind of a unique solution for arc, in my experience:


3 points by dido 3498 days ago | link

Interesting, although I don't know if it can be done with efficiency in the worst case. Backtracking regexes can take in their worst-case exponential time for matching, as illustrated in the regex in my original post.


2 points by Pauan 3498 days ago | link

I've used that library, and I think it's really awesome. It feels very Arc-like while still supporting regexp stuff.


1 point by dido 3498 days ago | link

I've not yet come across a Lisp dialect that incorporates regexes the way Perl and Ruby do, so I wonder what's a clean way of using them that respects the idioms of Arc. Then there's the matter of taint mode... How would a taint mode in Arc be like?