Arc Forumnew | comments | leaders | submitlogin
How to make this function simpler?
5 points by CatDancer 4563 days ago | 29 comments

  (def match-json-string (j)
    (if (is car.j #\")
         (let (j a)
              ((afn (j a)
                 (if no.j (err "missing close quote"))
                 (if (is car.j #\")
                      (list cdr.j a)
                      (iflet (j c) (match-json-backslash j)
                        (self j (cons c a))
                        (self (cdr j) (cons (car j) a)))))
               cdr.j nil)
           (list j (coerce rev.a 'string)))))
this takes a list of characters, and either returns nil if it is passed something that doesn't start with a quote:

  ; 30
  (match-json-string '(#\3 #\0))
  nil
or it returns a two element list, the remainder of the input and the parsed string:

  ; "AB\u0043",3
  (match-json-string '(#\" #\A #\B #\\ #\u #\0 #\0 #\4 #\3 #\" #\, #\3))
  ((#\, #\3) "ABC")
What the function does is pretty simple:

  if you see a "
    scan through the input
      error if no more input (missing close quote)
      see another ", you're done, return the accumulated result
      see a \, push the parsed backslash sequence onto the
          accumulated result and continue
      see any other char, push the char onto the accumulated
          result and continue
But I haven't been able to figure out how to make the code simpler.

I'm aware of the parser combinator library in Anarki, and I was able to get it to match things, but I wasn't able to figure out how to accumulate and return results with it.

If you'd like to run it, here's the rest of the code, which implements the match-json-backslash called by the function...

  (def hexdigit (c)
    (or (<= #\a c #\f) (<= #\A c #\F) digit.c))

  (def json-unicode-digits (j)
    (let u (firstn 4 j)
      (unless (and (is (len u) 4) (all hexdigit u)) (err "need 4 hexadecimal digits after \\u"))
      (coerce (int (coerce u 'string) 16) 'char)))

  (def match-char (j c next)
    (if (is car.j c)
         (next cdr.j)))

  (def match-json-unicode-escape (j)
    (match-char j #\u
      (fn (j)
        (list (nthcdr 4 j) (json-unicode-digits j)))))

  (def json-backslash-char (c)
    (case c
      #\" #\"
      #\\ #\\
      #\/ #\/
      #\b #\backspace
      #\f #\page
      #\n #\newline
      #\r #\return
      #\t #\tab
      (err "invalid JSON backslash char" c)))

  (def match-json-backslash (j)
    (match-char j #\\
      (fn (j)
        (if no.j (err "missing char after backslash"))
        (or match-json-unicode-escape.j
            (list cdr.j (json-backslash-char car.j))))))


3 points by CatDancer 4562 days ago | link

Here we go!

  (def match-json-string ()
    (match-char #\"
      (liststr:accum a
        (while (~match-char #\")
          (atend-err "missing close quote")
          (a (or (match-json-backslash)
                 (readc (stdin))))))))
I took shader's suggestion to use an input-port with peekc and readc. I figured that using a parameter to keep track of the parsing state was the way to go, and then realized that we already had a parameter (stdin) which did everything I needed it to. And taking Adlai's suggestion I made match-char a macro which avoids the extra fn's I had in the original.

Getting the parse position out of the function really broke the logjam. The old version was doing two things at once -- keeping track of the parse position and matching characters -- and that made it really hard to pull out bits of functionality into their own functions to make the main function shorter and clearer. Now it's much easier to tell what the function is doing! :)

Here's the rest of the code:

  (def hexdigit (c)
    (and (isa c 'char)
         (or (<= #\a c #\f) (<= #\A c #\F) (<= #\0 c #\9))))

  (def json-unicode-digits ()
    (let u (n-of 4 (readc (stdin)))
      (unless (all hexdigit u) (err "need 4 hexadecimal digits after \\u"))
      (coerce (int (coerce u 'string) 16) 'char)))

  ; json-backslash-char is the same

  (mac match-char (c . body)
    `(when (is (peekc (stdin)) ,c)
       (readc (stdin))
       ,@body))

  (def atend-err (msg)
    (unless (peekc (stdin))
      (err msg)))

  (def match-json-unicode-escape ()
    (match-char #\u (json-unicode-digits)))

  (def match-json-backslash ()
    (match-char #\\
      (atend-err "missing char after backslash")
      (or (match-json-unicode-escape)
          (json-backslash-char (readc (stdin))))))

  (def liststr (l)
    (coerce l 'string))
As you can see there's some more that could be done: a next-char function for (readc (stdin)) perhaps, and match-json-unicode-escape is so simple now that it could be inlined. That will be easy to work on :-)

Thank you everyone!

-----

1 point by shader 4561 days ago | link

Beautiful!

Only one comment: peekc and readc default to stdin, so you can leave out all of the (stdin) calls, unless you're leaving them for readability.

Either way, the new version is much more readable, shorter, and probably faster too. Congratulations!

-----

1 point by CatDancer 4561 days ago | link

You are using some other incarnation of Arc, perhaps?

  arc> (readc)
  Error: "procedure readc: expects 1 argument, given 0"
having readc and peekc default to stdin sounds like a real good idea though! :)

-----

1 point by shader 4561 days ago | link

Wow, a bug!

According to arcfn.com and the code that it references, it looks like it's supposed to default to stdin, but it's written in such a way to require at least one argument. If you pass nil, it reads from stdin because of the default.

The current code is:

  (xdef readc (lambda (str)
               (let ((p (if (ar-false? str)
                            (current-input-port)
                            str)))
                 (let ((c (read-char p)))
                   (if (eof-object? c) 'nil c)))))
Which requires at least on argument. It should probably be changed so that that argument is actually optional, but I don't know how. I should read up on my mzscheme ;)

-----

1 point by shader 4561 days ago | link

I think that optional arguments in scheme are just (var default) instead of just var.

So the new code would be:

    (xdef readc (lambda ((str (current-input-port))
               (let ((c (read-char p)))
                   (if (eof-object? c) 'nil c))))
I'm not certain - I haven't tested it, but it looks right. Unless scheme and arc have different ideas of false, which could cause more problems. Maybe it should just be:

    (xdef readc (lambda ((str (current-input-port))
               (let ((p (if (ar-false? str)
                            (current-input-port)
                            str)))
                 (let ((c (read-char p)))
                   (if (eof-object? c) 'nil c)))))
Which keeps the old false test just in case.

-----

1 point by CatDancer 4561 days ago | link

I think you should test it :-)

If it turns out not to work, see writec for an example of implementing an optional argument.

-----

1 point by shader 4561 days ago | link

hmmm. It seems that mzscheme (at least the version that I'm running arc on, 360) doesn't support optional args. I guess I'll have to do it some other way.

-----

3 points by shader 4561 days ago | link

Ok, I basically copied writec like you said, and the new version that actually works is:

  (xdef 'readc (lambda str
                 (let ((c (read-char
                           (if (pair? str)
                               (car str)
                               (current-input-port)))))
                   (if (eof-object? c) 'nil c))))
Now we just need to make the same transformation to readb and peekc, and then we're done.

btw, why is there no peekb? Or any other functions that work on bytes? (outside of binary.arc in Anarki) Did I overlook them?

-----

2 points by CatDancer 4561 days ago | link

It's fairly tedious to be doing this in Scheme, isn't it? We might let Scheme handle implementing the low level readc, and then in Arc redefine readc to be a more advanced function that can take an optional argument:

  (let original readc
    (= readc (fn ((o str (stdin)))
               (original str))))

  arc> (fromstring "abc" (readc))
  #\a
That pattern could be made into a macro:

  (mac redef (name args . body)
    `(let original ,name
       (= ,name (fn ,args ,@body))))
which makes writing the enhanced version of readc look like:

  (redef readc ((o str (stdin)))
    (original str))

-----

1 point by absz 4561 days ago | link

redef is already in Anarki; arc.arc, line 2446:

  (mac redef (name parms . body)
    " Redefine a function.  The old function definition may be used within
      `body' as the name `old'. "
    `(do (tostring
          (let old (varif ,name nilfn)
            (= ,name (fn ,parms ,@body))))
         ,name))
It's the same as yours, except (a) it suppresses the warning on re-assigning an identifier, and (b) it calls the original function old.

-----

2 points by CatDancer 4561 days ago | link

suppressing the warning doesn't appear to be necessary with =

  arc> (def foo () 3)
  #<procedure: foo>
  arc> (= foo 4)
  4
  arc>

-----

1 point by absz 4560 days ago | link

I was wondering if that was necessary... I think redef used to use set, which is why it was there.

-----

1 point by CatDancer 4561 days ago | link

btw, why is there no peekb? Or any other functions that work on bytes?

As part of the design process of finding the shortest Arc implementation, pg is careful not to include functions that he isn't actually using for news. This leads to some surprises (car, cdr, cadr, cddr, but no cdar?) but also removes cruft that builds up implementing things that people might need some day but turn out not to.

It turns out not to be a problem, since Arc is so concise it's really easy to extend, and so people quickly implement the things they need that pg happens not to be using for news.

I find I prefer the Arc approach, since libraries that try to provide everything I might need often have so much stuff that ironically they make it harder to implement what I actually need.

-----

2 points by Adlai 4561 days ago | link

Very useful... I'm surprised that it's not the "standard" yet. Nice job!

-----

1 point by pg 4560 days ago | link

Ok, readc, readb, and peekc now work this way.

-----

1 point by Adlai 4560 days ago | link

Probably a stupid question, but:

Does that mean that I should just manually patch ac.scm to comply with the new functions?

-----

1 point by CatDancer 4560 days ago | link

If you need the change right away you can patch ac.scm yourself, or, if you don't mind waiting, pg will eventually have a new arc3.tar containing the update.

-----

1 point by CatDancer 4561 days ago | link

I don't think it's a bug, readc has always required an argument. It would be an easy enhancement to make the argument optional though.

-----

1 point by Adlai 4561 days ago | link

It looks good! Thank you also for using my idea.

Side note: compare the "profiles" of this version, and the earlier versions -- this one has a much more "functional" profile.

I'm a bit confused what you mean about atend:err. Do you basically mean that there would be function composition between a macro, and a call to, for example, (err "missing char after backslash")? Something like

  (mac atend (alert)
    `(unless (peekc (stdin))
       ,alert))
I think I'm missing something...

EDIT: I get it now. I hadn't noticed that you only use /atend[-:]err/ at points where the next character might "correct" the error.

-----

1 point by CatDancer 4561 days ago | link

Do you basically mean that there would be function composition between a macro, and a call

Yes, a composition, though not a function composition. Because the Arc compiler rewrites (a:b ...) as (a (b ...)) when a:b appears in the first position in an expression, it works for macros also.

Thus

  (atend:err "missing close quote")
expands into

  (atend (err "missing close quote"))
which macro expands into

  (unless (peekc (stdin))
    (err "missing close quote"))

-----

1 point by CatDancer 4561 days ago | link

Say, I just realized that I could have an "atend" that would be a macro like match-char, and then the error check could look like:

  (atend:err "missing char after backslash")
funny how I only noticed that because of how I had spelled "atend-err" :-)

-----

1 point by Adlai 4562 days ago | link

Alrought, here's what I've cooked up. I'm not sure whether this is any better than CatDancer's original -- I was doing it more as a highly applied introduction to Arc :)

I copied out the code, commented it (spending about 80% of the time between www.arcfn.com and arc.arc), and then made two main changes. One is that I rewrote 'json-unicode-digits to only pass over the 4 chars once, rather than three times -- I did this by dissecting 'firstn, and putting the error checking forms inside there.

The other major change is the macro. I noticed that three functions used a similar pattern (the last one was a slight stretch, but after all, a 'let is a lambda deep inside).

My code:

  (def hexdigit (c)
    (or (<= #\a c #\f) (<= #\A c #\F) digit.c))

  (def json-backslash-char (c)
    (case c
      #\" #\"
      #\\ #\\
      #\/ #\/
      #\b #\backspace
      #\f #\page
      #\n #\newline
      #\r #\return
      #\t #\tab
      (err "invalid JSON backslash char" c) ) )

  ;; (The above two are verbatim from CatDancer's code
  ;;  Now comes some modified stuff)

  (def json-unicode-digits (j)
    (let u ((afn (n xs)
              (if (is n 0) nil
                    (or (no xs) (no:hexdigit (car xs)))
                  (err "need 4 hexadecimal digits after \u")
                    (cons (car xs) (self (-- n) (cdr xs)))))
            4 j)
      (coerce (int (coerce u 'string) 16) 'char) ) )

  (mac def-jsniffer (name c next (o arg 'cdr.j))
    `(def ,name (j)
       (if (is car.j ,c)
           (,next ,arg))))

  ;; ... because it sniffs ahead at the next char.
  
  (def-jsniffer match-json-unicode-escape #\u
    [list (nthcdr 4 _)
          (json-unicode-digits _)])

  (def-jsniffer match-json-backslash #\\
    [do (if no._ (err "missing char after backslash"))
        (or match-json-unicode-escape._
            (list cdr._ (json-backslash-char car._)) ) ] )

  (def-jsniffer match-json-string #\"
    (fn ((j a))
      (list j (coerce rev.a 'string)))
    ((afn (j a)
       (if no.j (err "missing close quote"))
       (if (is car.j #\") (list cdr.j a)
           (iflet (j c)
                 (match-json-backslash j)
               (self j (cons c a))
             (self (cdr j) (cons (car j) a)) ) ) )
     cdr.j nil))
Well, I hope that my experiment might have some useful ideas for you!

Adlai

-----

1 point by CatDancer 4562 days ago | link

I think this is really the key part:

  (mac def-jsniffer (name c next (o arg 'cdr.j))
    `(def ,name (j)
So we have three answers to the question, what to do if you're passing some state through a lot of functions, to the extent that passing it around is taking up as much code as what you're actually doing?

1. Put all the functions inside a surrounding lexical scope, and keep the state in some lexical variables that all the functions can access.

2. Write some macros so that you're still passing the state through all your functions, but the variables don't appear in your code.

3. Use some kind of dynamic binding (like Scheme parameters) so the functions can access the state without it having to be passed in as part of the function's arguments.

-----

1 point by CatDancer 4562 days ago | link

I think the question is what to do with the common operations such as moving the forward the pointer to what we're going to parse next. One way to do it is to keep state in a some surrounding lexical scope (similar to what lib/parser.arc in Anarki does):

  (def run-parser (j ...)
     (withs
        (next-char (fn () car.j))
        (bump-j (fn () (= j cdr.j))
        (parse-string (fn ()
                         if (is (next-char) #\")
                              (do (bump-j)
                                  ...
I might try keeping track of state implicitly using Scheme parameters and see if that makes things shorter.

-----

1 point by shader 4562 days ago | link

I think that char-char is supposed to be next char ;)

Other than that, I'm not good enough yet at writing parsers in arc to help you much. Your original was already pretty short, though there were a few spots that you could have used the . operator more (car, cdr, etc.)

I am also working on a parser too, though, so whatever you discover will certainly be helpful ;)

-----

1 point by CatDancer 4562 days ago | link

char-char

ha, I just squeaked in under the one-hour edit window :-)

-----

1 point by shader 4562 days ago | link

You know, if you used a input-port instead of a string, then next-char and bump-j would just be peekc and readc. I don't know if that would make you're life much easier, but it's an idea.

-----

1 point by CatDancer 4562 days ago | link

Right, a port encapsulates some state (where you're reading from, what your position is in the input), and then Arc uses a Scheme parameter (stdin) so that the input port doesn't need to be passed to every function that calls some function that calls some function that calls read or readc. So like you say I could use an input-port if that did what I needed, or I could implement my own thing if that turned out to be necessary.

-----

1 point by CatDancer 4562 days ago | link

Oops, typo in the title of my post, I mean "shorter": How could this function be shorter?

Hmm, maybe using macros to make the scanning variable j implicit? I'll try that when I have some more time...

-----