Arc Forum | You'd be surprised how handy /pattern/, =~, !~ can be :) module RegexPatterns ...

Arc Forum

3 points by lojic 6499 days ago | link | parent

You'd be surprised how handy /pattern/, =~, !~ can be :)

  module RegexPatterns
    MONEY = /^(\$[ ]*)?([0-9]+|[0-9]{1,3}(,[0-9]{3})*)?(\.[0-9]{0,2})?$/
    ...
  end

  def valid_currency? str
    str =~ RegexPatterns::MONEY
  end

  def parse_money str
    if str =~ RegexPatterns::MONEY
      str.delete('$, ').to_f
    else
      raise "invalid input"
    end
  end

Not sure how you'd handle group extraction nicely without resorting to implicit variable assignments though.

  raise 'data error' unless data =~ /(\d+) items/
  num_items = $1.to_i

I almost rejected Ruby when an initial perusal of a Ruby text showed some Perlisms, but I have to admit that the regular expression handling of Ruby is a joy to use. If you can make regular expression usage a joy in Arc, that would be awesome. I posted on this thread because I'm not sure if that can be done outside of the core with libraries alone.

8 points by shiro 6499 days ago | link

I incorporated regexp literal in Gauche Scheme and found it very handy. Regexp literal is written as #/pattern/. When appears in the procedure position it also works as a matcher.

    (#/\d+/ "abc123")  => #<match object>

The matcher returns a match object if the given string matches the pattern; you can extract submatches from it. The match object also works like a procedure.

    (cond [(#/(\d+)-(\d+)/ "123-456") => (cut map <> '(1 2))])    
        => ("123" "456")

The good thing I found about this "acts like a procedure" feature is that I can pass around it wherever a procedure is expected. For example, grep can be expressed in terms of the standard 'filter' procedure.

    (filter #/\w+/ list-of-strings)

(I'm not sure Arc can go this direction, since Arc's operators that takes predicates (e.g. 'find', 'some', 'all', ...) does "auto-testification"---if a given object isn't a procedure, it creates a predicate that tests equality to the given object---which may conflict with this type of extended use of "callable objects".)

-----

1 point by partdavid 6495 days ago | link

There's nothing joyful about regular expressions. For one thing, as above, it leaks logic all over your code. Secondly, it's unclear--it would be quite difficult to identify a bug in your expression. A related clarity problem is that you have restricted inputs to a subset of valid inputs, and it's hard to see how or why. Third, they are brittle and hard to instrument for diagnostics.

-----

1 point by lojic 6494 days ago | link

I don't see how it "leaks logic all over your code". But I like to keep an open mind - what are you suggesting as an alternative for the above example?

Regarding the difficulty in identifying a bug in the expression, I tend to agree. That's why I have a lot of unit test cases for each meaningful regular expression.

Regarding restricting the inputs to a subset of valid inputs, which inputs would you like to accept that the regex rejects (for U.S. currency only)? I haven't had any complaints yet, but that's not to say it won't happen in the future.

-----

1 point by partdavid 6486 days ago | link

1) If you have capture patterns, you have code in one place dependent on the expression in another without the coupling being clear. More mildly, you are married to regular expression operators because you have direct references to your (regular-expression-defined) subprogram all over. You can't decide not to make it a regular expression; or to make it two, or make it a much clearer expression and a programmatic dress-up. The alternative is to not use regular expressions.

2) Eh, unit tests can catch the mistakes you anticipated making. Lots of other mistakes are possible. Why write a complex regular expression and page full of unit testing code for it when you could write more straightforward logic.

3) Like I said, it's not clear why you have the expressions you do.

I'm not saying regular expressions never ever have their place. In particular, they can be a convenient method to offer users to specify search and validation patterns and that kind of thing. But fixing program logic into them is a bad idea.

Now, if it's inconvenient or inefficient to express that textual extraction in some way other than regular expressions, I'm suggesting that is a failure of the language (for example, because pattern matching is weak or specifying grammars is cumbersome), not a point for recommending regular expressions.

-----

2 points by lojic 6470 days ago | link

Sorry, I just now saw this. The Arc forum makes it darn near impossible to realize someone has replied to an older item :(

I think an example of what you're talking about would be great. If you have a better way to validate textual data than regular expressions, then naturally I would want to know about it.

Here's a few regular expressions I've collected. I realize they're not perfect (e.g. the email regex), but they're good enough for my purposes presently.

    REGEX_EMAIL     = /^([-\w+.]+)@(([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})|(([-\w]+\.)+[a-zA-Z]{2,4}))$/
    REGEX_FLOAT     = /^([0-9]+|[0-9]{1,3}(,[0-9]{3})*)?(\.[0-9]*)?$/
    REGEX_INTEGER   = /^([0-9]+|[0-9]{1,3}(,[0-9]{3})*)$/
    REGEX_ISO_8601  = /^(\d{4})-(\d{2})-(\d{2})[Tt](\d{2})[-:](\d{2})[-:](\d{2})[Zz]$/
    REGEX_MONEY     = /^(\$[ ]*)?([0-9]+|[0-9]{1,3}(,[0-9]{3})*)?(\.[0-9]{0,2})?$/
    REGEX_PHONE     = /^(\((\d{3})\)|(\d{3}))[-. ]?(\d{3})[-.]?(\d{4})[ ]*([^\s\d].{0,9})?$/
    REGEX_SSN       = /^(\d{3})-?(\d{2})-?(\d{4})$/
    REGEX_ZIP_CODE  = /^(\d{5})[- ]?(\d{4})?$/

So, what would you use to accomplish the same thing without regular expressions that is as concise? The regular expressions allow an easy way to both validate user input and parse it via groups. They're declarative vs. imperative. I have these in a Ruby module with associated functions for parsing (which primarily uses the groups) etc., so they're encapsulated together.

I think you mentioned you're an Erlang programmer, so how would the non-regex Erlang code look to validate an email address corresponding to the REGEX_EMAIL expression above?

-----

1 point by partdavid 6460 days ago | link

Ah, you're right, it's a bit hard to see when folks have replied. Yes, I'm an Erlang programmer.

In response to your question, I don't accept your premise that replicating a particular regular expression is a real programming task. You say your email regular expression isn't perfect, but it's not clear to me why you chose those particular set of restrictions beyond what's defined in the RFC--so it's a little hard for me to replicate (for example, the local-part and domain of the address can have a more kinds of characters that what you have defined).

Instead, I'll offer this as a non-equivalent but interesting comparison. I've elided the module declarations (as have you), including the imports that allow some of these functions without their module qualifiers:

  email(S) ->
     [User, Domainp] = tokens(S, "@"),
     {User,
      case {address(Domainp), reverse(tokens(Domainp, "."))} of
         {{ok, Addr}, _} -> Addr;
         {_, RDomain = [Tld|_]} when length(Tld) >= 2,
                                     length(Tld) =< 4 ->
            join(reverse(RDomain), ".")
      end
     }.

I don't know how the terseness of this compares with your example, given that it includes some things that yours doesn't (a way to call it, a format for the return value rather than the capture variables). Terseness, of course, in the pg sense of code tree nodes, whatever they are. :)

The Erlang function above returns a tuple of the local-part and the domain part and throws an error if it can't match the address. If this were something I wanted to ship around to other functions or send to another machine or store in a database table or something, I would have email/1 return a fun (function object) instead.

If either one of us wanted something better than what we have (or even if we don't--it seems like coming up with The Right Thing To Do With Email Addresses is worth a bit of time to do only once) I would write a grammar. The applicable RFC 2822 more or less contains the correct one, which is only a few lines.

At the "low" end of text processing power, there are basic functional operations on strings and lists, and at the "high" end there are grammar specifiers and parser generators. In the band in between lives regular expressions, and I am not convinced that that band is very wide. I like regular expressions (and, indeed, I would like it very much if Erlang had better support for them) but for me they are a specialized tool, particularly useful (like wildcard globbing) for offering as an input method to users.

But they aren't a general solution to every kind of problem, and for that reason I don't think Arc or any other general-purpose language benefits from baking them into the basic syntax--they belong in a library.

-----