Arc Forumnew | comments | leaders | submitlogin
Is this a correct implementation of Norvig's spelling corrector? (archub.org)
13 points by pg 3972 days ago | 12 comments


4 points by palsecam 3972 days ago | link

References:

* the original version of Norvig's spelling corrector is written in Python, is 21 lines, and can be found at http://norvig.com/spell-correct.html, w/ lots of explanations and test materials.

* a thread related to this on this forum: http://arclanguage.org/item?id=7906 (Norvig's spelling corrector in Arc?)

---

> Is this a correct impl?

Will look at it in details this evening, but for now:

* not sure if this changes anything, but in Norvig's edits1() version:

   >>> edits1("python")
   # the var `s' is [('', 'python'), ('p', 'ython'), ('py', 'thon'), 
   ('pyt', 'hon'), ('pyth', 'on'), ('pytho', 'n'), ('python', '')]
But in your Arc version, the result of (map [split ... (- (len word) 1)) omits the last ("python" "")

   arc>	(def e (word)
   	  (map [split word _] (range 0 (- (len word) 1))))
   #<procedure: e>
   arc>	(e "python")
   (("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on") 
   ("pytho" "n"))
   arc>	(def e (word)  ; "Fixed", making it even shorter :-)
          (map [split word _] (range 0 (len word))))
   *** redefining e
   #<procedure: e>
   arc>	(e "python")
   (("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on") 
   ("pytho" "n") ("python" ""))
Makes me think, (- (len xs) 1) is a common idiom in languages w/ zero-based indexing.

Makes me think, maybe 'range should behave like in Python, i.e: if there is only one arg, the `start' is implicitely 0.

---

EDIT: oh my god, there are links to the impls of this spelling corrector in other languages on Norvig's page, and the Perl one is 63 lines (!) and is written in Perl6. I suppose I'd not be able to sleep well before trying to make something shorter (and in Perl5). A Perl program 3 times longer than its Python equivalent, never seen that before!

EDIT2: heck, thinking of it, the list-comprehension and special indexes of Python (word[:i]/word[i:]) may actually not be easy/short to do in Perl...

-----

4 points by palsecam 3972 days ago | link

OK, quick & dirty implementation in Perl5, copying pg's impl:

   open F, 'big.txt';
   while (<F>) { $NWORDS{$_}++ for (lc =~ /[a-z]+/g); }

   sub dedup { %h = map { $_, 1 } @_; keys %h }

   sub edits1 {
     $word = shift;
     &dedup(map { ($a, $b) = @$_;
	          ( $a . substr($b, 1),
	            $a . substr($b, 1, 1) . substr($b, 0, 1) . substr($b, 2),
	            map { ($a . $_ . substr($b,1), $a . $_ . $b) } 'a'..'z' )
       } map { [substr($word, 0, $_), substr($word, $_)] } 0..length($word)-1);
   }

   sub edits2 { &dedup(map { &edits1($_) } &edits1(shift)) }

   sub correct {
     $win = shift;
     for (&edits1($win), &edits2($win)) {
	$win = $_ if ($NWORDS{$_} > $NWORDS{$win});
     }
     $win;
   }
That's 19 lines of code. So Arc "wins" (and is far more elegant anyway, this is spaghetti Perl). But I feel better to know a Perl version doesn't need to be 63 lines long (http://www.riffraff.info/2007/5/20/a-spell-corrector-in-perl...).

But. Both this version & pg's one are actually buggy. Obvious example:

  arc> (correct "yellow")
  "fellow"
Canonical version gives the expected answer, "yellow". So I'd say, no, these are not correct implementations (the Perl version seems to always give the same results than the Arc one).

A thing I have remarked is that the list returned by 'edits{1|2} contains the original value (i.e: (find "python" (edits1 "python")) is t), where this is not the behaviour of Norvig's version. (That's also why the loop in my correct() isn't (for ($win, &edits1($win), &edits2($win)), in contrary to the Arc one).

And this may be incorrect. Or maybe, if $win/word is in nwords, 'correct should stop (immediately). This would - at least - correct the "yellow" -> "fellow" problem.

This is, however, not the only issue. Norvig's version and these versions, given the same "big.txt", don't give the same correction for other words (try "speling", "godd"). And I strongly suppose that Norvig's version is the most correct.

-----

3 points by palsecam 3972 days ago | link

  (def known (words) (dedup:keep [nwords _] words))  ; lines count is now 12

  (def correct2 (word (o f [nwords _]))
    (most f 
      (or (known:list word) (known:edits1 word) (known:edits2 word) (list word))))
Or:

  (def correct3 (word (o f [nwords _]))  ; don't need 'known, but require aspirin
    (most f (or ((apply orf (map [fn (w) (dedup:keep [nwords _] (_ w))]
                              (list list edits1 edits2)))
                   word) (list word))))

  arc> (correct{2/3} "yellow")
  yellow
  arc> (correct{2/3} "godd")
  good
  arc> (correct{2/3} "speling")
  spelling
  arc> (correct{2/3} "unnkknnoown")
  unnkknnoown
Not exactly the same results than Norvig's version (>>> correct("speling") -> "sling" / "godd" -> "god") but I tested the Ruby version linked on his site, and it yields the same results. Note that the result for "speling" is not really good in canonical version. Maybe it's because the order on Python sets is different from the one in Ruby/Arc lists. I should port the test program of Norvig to stop worrying, but it's OK for now. For now, let's say this version is better than the Norvig's one (!!!)

Bonus: performance is better on average.

---

EDIT:

1. also need to be clarified:

* (range 0 (- (len word) 1)) VS (range 0 (len word))

* the need or not of (known:list word)

2. A revised version for the language with the happiest users (according to... Twitter :-D, see http://blog.doloreslabs.com/2009/05/the-programming-language... FTR) is left as an exercice to the reader. Hey it's an Arc forum here, not a Perl one ;-)!

3. In the same vein, could be interesting to solve the Euler Project (http://projecteuler.net/) in Arc. I think zck has done some work in this area (http://news.ycombinator.com/item?id=837030).

-----

1 point by s-phi-nl 3958 days ago | link

When I run Norvig's version, I do not get the results you do, but get the same results as you get from the Ruby/Arc versions.

  >>> spell.correct('speling')
  'spelling'
  >>> spell.correct('godd')
  'good'

-----

3 points by lojic 3963 days ago | link

If it is correct, it's impressive. I wrote the Ruby version that Norvig links to which is here:

http://lojic.com/blog/2008/09/04/how-to-write-a-spelling-cor...

One third the lines of the Ruby version and roughly half the lines of Norvig's Python version is quite an endorsement of conciseness.

If anyone has input data that demonstrates a difference between pg's and Norvig's, please reply with a comment.

Also, Paul, if it is equivalent in functionality, I'd pass it on to Norvig and he can add a link to his page.

http://norvig.com/spell-correct.html

You may want a nicer page to link to - maybe a blog post or something with more background info.

I actually get fairly consistent traffic from Norvig's site to my blog post, so it could be a way to expose Arc to more programmers.

-----

2 points by palsecam 3963 days ago | link

> If anyone has input data that demonstrates a difference between pg's and Norvig's, please reply with a comment.

In case you didn't see it, see above (http://arclanguage.org/item?id=10580)

Not exactly the same results than Norvig's version (>>> correct("speling") -> "sling" / "godd" -> "god") but I tested the Ruby version linked on his site, and it yields the same results.

I.e: Your / my "correction" of pg's version give "spelling" and "good".

And pg's vanilla version for (correct "yellow") gives "fellow".

-----

2 points by lojic 3963 days ago | link

Sorry, I noticed your post after I posted in haste. Nice work - 12 lines is still darn impressive. Followup with a comment if you use Norvig's test program and notice any other issues.

-----

3 points by thaddeus 3958 days ago | link

http://norvig.com/spell-correct.html

Noting the python 2.5 code atop the page I notice, as an example:

   def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
is on two lines. Where as the arc code puts it on one:

   (def edits2 (word) (dedup:flat (map edits1 (edits1 word))))
I'm just wondering/curious... what determines a line of code ?

I have no doubt the arc code is shorter character wise, but what's to stop me from putting it all on 1 line and being the winner?

-----

2 points by waterhouse 3957 days ago | link

I would say, the code is put into lines naturally--the Arc edits2 definition is short enough that it fits comfortably on one line, while the Python known_edits2 definition probably does not--and then you count up the lines. (I say "probably" because perhaps some people would tolerate lines that long in their code. But I wouldn't, and it seems that neither did the person who wrote it.) What stops you from putting it all on one line and declaring victory? The fact that that's not how humans naturally read or write code; the first thing I'd do when reading such a program would be to separate it back into lines.

Anyway, if you want a more precisely standardizable measure of program length, try token count. That can't be fooled with, either by deleting whitespace or by giving things shorter names. (Though, if you wanted, you could take the program as a string, replace all whitespace with a special character so it's all one token, write a macro that converts the string back into real code, and thus convert an arbitrarily long program into one with a couple dozen tokens. Hmm. Is there any standardized measure of program length that can't be fooled with?)

-----

1 point by thaddeus 3957 days ago | link

> "I would say, the code is put into lines naturally"

I beg to differ. I've being reading pg's code and feel quite comfortable saying that his natural coding would have a line break in the example above - as would any of ours (unless you are unaware of the two spaces required for code on this forum - haha :)

Not a criticism of pg or his code. I think most the submissions to Norvigs site contain code where people are removing "natural" line breaks.

As an example..... Quote from comment on page: http://www.jelovic.com/weblog/?p=201

  "Perhaps we should stop at 15 lines before we start getting unreadable code :) !"
speaks the truth.

I don't think there's any real value in creating a table ranking languages by line numbers. I do think the code pg produced does demonstrate how pleasant arc is to read while being very short.

-----

3 points by pg 3954 days ago | link

I broke lines that would have been more than 80 characters.

-----

2 points by rkts 3957 days ago | link

This implementation is not quite correct because it doesn't do insertion at the end. If the last letter of a word is missing, the Arc will correct it by adding a letter at the next-to-last place and then swapping the last two letters; however, if the word has another error, the Arc won't correct it, whereas the Python will.

Whether this is a serious problem is open to question. However, it does mean that the comparison with Python is not apples to apples.

Incidentally, in case anyone cares, here's how it looks in rkts-lang,* a little toy language I've been working on:

  nwords = words "big.txt" . map downcase . Hap.counts

  alphabet = "abcdefghijklmnopqrstuvwxyz"

  edits1 w = Iter as c:
     n = size w
     for i of 0 to n-1: c (w[below i] ... w[above i])
     for i of 0 to n-2: c (w[i to i+1] @ w[i+1 downto i])
     for i of 0 to n-1: for l of alphabet: c (w[i] @ l)
     for i of 0 to n:   for l of alphabet: c (w[below i] ... l ... w[upfrom i])

  edits n w = iterate (join edits1) {w} . take n+1

  correct w = edits 2 w . map (keep nwords[]) . find ~empty . greatest nwords[]
If we permit the pg bug, then edits1 can be shortened a bit:

  edits1 w = Iter as c:
     for i of keys w:
        c (w[below i] ... w[above i])
        if i > 0: c (w[i-1 to i] @ w[i downto i-1])
        for l of alphabet: c (w[i] @ l); c (w[below i] ... l ... w[upfrom i])
(* Suggestions for a catchy name may be rewarded.)

-----