Arc Forumnew | comments | leaders | submitlogin
4 points by palsecam 5319 days ago | link | parent

References:

* the original version of Norvig's spelling corrector is written in Python, is 21 lines, and can be found at http://norvig.com/spell-correct.html, w/ lots of explanations and test materials.

* a thread related to this on this forum: http://arclanguage.org/item?id=7906 (Norvig's spelling corrector in Arc?)

---

> Is this a correct impl?

Will look at it in details this evening, but for now:

* not sure if this changes anything, but in Norvig's edits1() version:

   >>> edits1("python")
   # the var `s' is [('', 'python'), ('p', 'ython'), ('py', 'thon'), 
   ('pyt', 'hon'), ('pyth', 'on'), ('pytho', 'n'), ('python', '')]
But in your Arc version, the result of (map [split ... (- (len word) 1)) omits the last ("python" "")

   arc>	(def e (word)
   	  (map [split word _] (range 0 (- (len word) 1))))
   #<procedure: e>
   arc>	(e "python")
   (("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on") 
   ("pytho" "n"))
   arc>	(def e (word)  ; "Fixed", making it even shorter :-)
          (map [split word _] (range 0 (len word))))
   *** redefining e
   #<procedure: e>
   arc>	(e "python")
   (("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on") 
   ("pytho" "n") ("python" ""))
Makes me think, (- (len xs) 1) is a common idiom in languages w/ zero-based indexing.

Makes me think, maybe 'range should behave like in Python, i.e: if there is only one arg, the `start' is implicitely 0.

---

EDIT: oh my god, there are links to the impls of this spelling corrector in other languages on Norvig's page, and the Perl one is 63 lines (!) and is written in Perl6. I suppose I'd not be able to sleep well before trying to make something shorter (and in Perl5). A Perl program 3 times longer than its Python equivalent, never seen that before!

EDIT2: heck, thinking of it, the list-comprehension and special indexes of Python (word[:i]/word[i:]) may actually not be easy/short to do in Perl...



4 points by palsecam 5319 days ago | link

OK, quick & dirty implementation in Perl5, copying pg's impl:

   open F, 'big.txt';
   while (<F>) { $NWORDS{$_}++ for (lc =~ /[a-z]+/g); }

   sub dedup { %h = map { $_, 1 } @_; keys %h }

   sub edits1 {
     $word = shift;
     &dedup(map { ($a, $b) = @$_;
	          ( $a . substr($b, 1),
	            $a . substr($b, 1, 1) . substr($b, 0, 1) . substr($b, 2),
	            map { ($a . $_ . substr($b,1), $a . $_ . $b) } 'a'..'z' )
       } map { [substr($word, 0, $_), substr($word, $_)] } 0..length($word)-1);
   }

   sub edits2 { &dedup(map { &edits1($_) } &edits1(shift)) }

   sub correct {
     $win = shift;
     for (&edits1($win), &edits2($win)) {
	$win = $_ if ($NWORDS{$_} > $NWORDS{$win});
     }
     $win;
   }
That's 19 lines of code. So Arc "wins" (and is far more elegant anyway, this is spaghetti Perl). But I feel better to know a Perl version doesn't need to be 63 lines long (http://www.riffraff.info/2007/5/20/a-spell-corrector-in-perl...).

But. Both this version & pg's one are actually buggy. Obvious example:

  arc> (correct "yellow")
  "fellow"
Canonical version gives the expected answer, "yellow". So I'd say, no, these are not correct implementations (the Perl version seems to always give the same results than the Arc one).

A thing I have remarked is that the list returned by 'edits{1|2} contains the original value (i.e: (find "python" (edits1 "python")) is t), where this is not the behaviour of Norvig's version. (That's also why the loop in my correct() isn't (for ($win, &edits1($win), &edits2($win)), in contrary to the Arc one).

And this may be incorrect. Or maybe, if $win/word is in nwords, 'correct should stop (immediately). This would - at least - correct the "yellow" -> "fellow" problem.

This is, however, not the only issue. Norvig's version and these versions, given the same "big.txt", don't give the same correction for other words (try "speling", "godd"). And I strongly suppose that Norvig's version is the most correct.

-----

3 points by palsecam 5319 days ago | link

  (def known (words) (dedup:keep [nwords _] words))  ; lines count is now 12

  (def correct2 (word (o f [nwords _]))
    (most f 
      (or (known:list word) (known:edits1 word) (known:edits2 word) (list word))))
Or:

  (def correct3 (word (o f [nwords _]))  ; don't need 'known, but require aspirin
    (most f (or ((apply orf (map [fn (w) (dedup:keep [nwords _] (_ w))]
                              (list list edits1 edits2)))
                   word) (list word))))

  arc> (correct{2/3} "yellow")
  yellow
  arc> (correct{2/3} "godd")
  good
  arc> (correct{2/3} "speling")
  spelling
  arc> (correct{2/3} "unnkknnoown")
  unnkknnoown
Not exactly the same results than Norvig's version (>>> correct("speling") -> "sling" / "godd" -> "god") but I tested the Ruby version linked on his site, and it yields the same results. Note that the result for "speling" is not really good in canonical version. Maybe it's because the order on Python sets is different from the one in Ruby/Arc lists. I should port the test program of Norvig to stop worrying, but it's OK for now. For now, let's say this version is better than the Norvig's one (!!!)

Bonus: performance is better on average.

---

EDIT:

1. also need to be clarified:

* (range 0 (- (len word) 1)) VS (range 0 (len word))

* the need or not of (known:list word)

2. A revised version for the language with the happiest users (according to... Twitter :-D, see http://blog.doloreslabs.com/2009/05/the-programming-language... FTR) is left as an exercice to the reader. Hey it's an Arc forum here, not a Perl one ;-)!

3. In the same vein, could be interesting to solve the Euler Project (http://projecteuler.net/) in Arc. I think zck has done some work in this area (http://news.ycombinator.com/item?id=837030).

-----

1 point by s-phi-nl 5305 days ago | link

When I run Norvig's version, I do not get the results you do, but get the same results as you get from the Ruby/Arc versions.

  >>> spell.correct('speling')
  'spelling'
  >>> spell.correct('godd')
  'good'

-----