Arc Forum | Regular expressions

Arc Forum

14 points by lojic 6326 days ago | 19 comments

Sooner or later, I'm going to need regular expressions. Edi Weitz' CL-PPCRE has a fair amount of respect amongst Common Lisp folks. What do you think of the idea of porting it to Arc?

I've heard it's well written and efficient. I think I'm spoiled from Ruby's regex api, so the api doesn't feel that nice to me, but if we're not going to have syntax support for regexes, it may be a reasonable choice.

http://weitz.de/cl-ppcre/

"It comes with a BSD-style license so you can basically do with it whatever you want."

4 points by almkglor 6326 days ago | link

currently arc-wiki rides regular expressions on top of mzscheme's.

Porting PPCRE should be an interesting project of itself.

-----

1 point by lojic 6326 days ago | link

Here's the file list with line counts to give you an idea of the scope w/o having to download the tarball.

  $ find . -name \*.lisp | xargs wc
  1264   5548  62182 ./api.lisp
   579   2692  27510 ./closures.lisp
   811   3294  40252 ./convert.lisp
    84    425   3686 ./errors.lisp
   736   3030  32096 ./lexer.lisp
    56    288   2362 ./lispworks-defsystem.lisp
    67    324   3100 ./load.lisp
   579   2219  25300 ./optimize.lisp
   106    333   3736 ./packages.lisp
   323   1397  15593 ./parser.lisp
   268   1070  13784 ./ppcre-tests.lisp
   807   2950  31225 ./regex-class.lisp
   844   4049  42124 ./repetition-closures.lisp
   507   2515  26187 ./scanner.lisp
   147    660   5347 ./specials.lisp
   299   1547  12843 ./util.lisp
  7477  32341 347327 total

-----

2 points by treef 6326 days ago | link

It also looks like its OO using Clos and it looks very un-arcish presenting the very hard problem with in CS - naming things - or in this case renaming things shorter.

-----

1 point by sjs 6325 days ago | link

This SRE (Scheme Reg. Ex.) proposal looks interesting. It's dated '98 and I didn't find any info about an implementation so perhaps it's not interesting in practice, or maybe no one has just done it yet. It's a superset of regular expressions with an s-expr syntax that allows for neat things, such as (- alpha ("aeiouAEIOU")) to match vowels.

http://www.scsh.net/docu/post/sre.html

[Unfortunately I can't get to the site right now but the link worked recently so hopefully scsh.net is just down temporarily.]

-----

1 point by nex3 6325 days ago | link

Yeah, I saw that, too. I agree it looks cool, but I'm a little skeptical because it's so much less concise than standard regular expressions. Maybe this is really a good thing, I dunno.

-----

1 point by beyert 6295 days ago | link

The scsh sres really are wonderful, much easier to use than regular regular expressions, (especially compared to xemacs which I constantly make mistakes with) and you don't have to deal with complex string quoting to use them.

If I were Paul, I would design a similar API for regular expressions.

-----

2 points by map 6325 days ago | link

I'm for adding regular expressions and other string-handling functions. String manipulation is one of the most common tasks in my programs. Are mzscheme's regular expressions good enough?

-----

2 points by map 6325 days ago | link

Can this tiny Ruby program that I recently wrote for my own use be easily converted to Arc? If not, what additions to the language would make it easy to convert this program? (Please keep the string containing the dates unchanged; don't pamper the language.)

  require 'date'

  "
  2008.2.18
  2008--3--21
    2008/5/26
  2008_7_4

  ".scan( /\S+/ ){|date_str|
    date_ary = date_str.split(/\D+/).map{|s| s.to_i}
    the_date = Date.new( *date_ary )
    diff = the_date - Date.today
    if (0..7).include?( diff )
      puts "***"
      puts "*****"
      puts "*******"
      puts "\a  #{ the_date } is a holiday!"
      puts "*******"
      puts "*****"
      puts "***"
    end
  }

-----

4 points by nex3 6325 days ago | link

This is largely a library issue - Arc could certainly do with better support for string ops, regexen, and especially datetime manipulation. But here's what I came up with (note that date-days is pretty inaccurate most of the time):

  (def compact (seq) (keep ~empty seq))

  (def date-days (date)
    (+ (* 365 (date 'year))
       (* 30  (date 'month))
       (date 'day)))

  (def date- (d1 d2)
    (- (date-days d1) (date-days d2)))

  (def format-date (date)
    (string (pad (date 'year) 4 #\0) "-"
            (pad (date 'month) 2 #\0) "-"
            (pad (date 'day) 2 #\0)))

  (= str "
  2008.2.18
  2008--3--21
    2008/5/26
  2008_7_4
  
  ")

  (each date (map (fn (date)
                    (map [coerce _ 'int]
                         (compact:ssplit date [no (<= #\0 _ #\9)])))
                  (compact:ssplit str))
    (let the-date (obj year (date 0) month (date 1) day (date 2))
      (when (<= 0 (date- the-date (datetbl (seconds))) 7)
        (prn "***")
        (prn "*****")
        (prn "*******")
        (prn "\a  " (format-date the-date) " is a holiday!")
        (prn "*******")
        (prn "*****")
        (prn "***"))))

-----

4 points by map 6325 days ago | link

Good work.

I think you've highlighted at least one gap in Arc's arsenal. Using Arc2:

  arc> (ssplit " foo bar ")
  Error: "reference to undefined identifier: _ssplit"

You needed to use ssplit, but Arc doesn't have it.

I don't think the importance of string ops should be underestimated. Strings ops are just as essential as numerical ops. A language that cannot effortlessly manipulate strings is a low-level language in my book. If people are supposed to be testing Arc by using it instead of the languages they were using, Arc needs string ops. Can't they be easily lifted from mzscheme?

Remember the thread on implementing Eliza in Lisp and Ruby? No one posted an Arc version.

-----

3 points by nex3 6325 days ago | link

Oh, sorry, I should have specified: I used Anarki-specific stuff in several places. Mostly the date-manipulation, but also ssplit (I actually hadn't realized that wasn't in arc2. Yikes). Using Anarki, it should work, though.

I totally agree that strings ops are important. If I recall, PG has also said something to this effect, so I wouldn't be surprised if more of them crop up in the next few releases.

-----

3 points by bogomipz 6325 days ago | link

So ruby has a syntax for regular expressions, such as /\D+/. What I've always wondered is, does this have any advantage at all?

I mean, the actual regex operations are done by methods on the string class, which like nex3 mentioned is at the library level.

Is there any reason

  a_string.split(/\D+/)

is better than

  a_string.split("\D+")

Please do enlighten me.

-----

5 points by nex3 6325 days ago | link

A distinction between regexen and strings is actually very handy. I've done a fair bit of coding in Ruby, where this distinction is present, and a fair bit in Emacs Lisp, where it's not.

There are really two places where it's really important. First, if regexen are strings, then you have to double-escape everything. /\.foo/ becomes "\\.foo". /"([^"]|\\"|\\\\)+"/ becomes "\"([^\"]|\\\\"|\\\\\\\\)+\"". Which is preferable?

Second, it's very often useful to treat strings as auto-escaped regexps. For instance,

  a_string.split("\D+")

is actually valid Ruby. It's equivalent to

  a_string.split("D+")

because D isn't an escape char, which will split the string on the literal string "D+". For example

  "BAD++".split("D+") #=> ["BA", "+"]

Now, I'm not convinced that regexen are necessary for nearly as many string operations as they're typically used for. But I think no matter how powerful a standard string library a language has, they'll still be useful sometimes, and then it's a great boon to have literal syntax for them.

-----

3 points by bogomipz 6325 days ago | link

Ok, so what it comes down to, is that you don't want escapes to be processed. Wouldn't providing a non-escapable string be far more general, then?

Since '\D+' clashes with quote, maybe /\D+/ is a good choice for the non-escapable string syntax. Only problem is that using it in other places might trigger some reactions as the slashes make everybody think of it as "regex syntax".

-----

3 points by nex3 6325 days ago | link

Escaping isn't the only thing. Duck typing is also a good reason to differentiate regular expressions and strings. foo.gsub("()", "nil") is distinct from foo.gsub(/()/, "nil"), and both are useful enough to make both usable. There are lots of similar issues - for instance, it would be very useful to make (/foo/ str) return some sort of match data, but that wouldn't be possible if regexps and strings were the same type.

-----

4 points by bogomipz 6324 days ago | link

Now we're getting somewhere :) For this argument to really convince me, though, Arc needs better support for user defined types. It should be possible to write special cases of existing functions without touching the core definition. Some core functions use case forms or similar to treat data types differently. Extending those is not really supported. PG has said a couple of times;

"We believe Lisp should let you define new types that are treated just like the built-in types-- just as it lets you define new functions that are treated just like the built-in functions."

Using annotate and rep doesn't feel "just like built-in types" quite yet.

-----

2 points by almkglor 6324 days ago | link

Try 'redef on nex3's arc-wiki.git. You might also be interested in my settable-fn.arc and nex3's take on it (settable-fn2.arc).

-----

3 points by earthboundkid 6324 days ago | link

You could always do it the Python way: r"\D+" => '\\D+'

There's also u"" for Unicode strings (in Python <3.0) and b"" for byte strings (in Python >2.6).

-----

2 points by map 6325 days ago | link

If the "x" modifier is used, whitespace and comments in the regex are ignored.

  re =
  %r{
      # year
      (\d {4})
      # separator is one or more non-digits
      \D+
      # month
      (\d\d)
      # separator is one or more non-digits
      \D+
      # day
      (\d\d)
  }x

  p "the 1st date, 1984-08-08, was ignored".match(re).captures

  --->["1984", "08", "08"]

-----