* the original version of Norvig's spelling corrector is written in Python, is 21 lines, and can be found at http://norvig.com/spell-correct.html, w/ lots of explanations and test materials.
Will look at it in details this evening, but for now:
* not sure if this changes anything, but in Norvig's edits1() version:
>>> edits1("python")
# the var `s' is [('', 'python'), ('p', 'ython'), ('py', 'thon'),
('pyt', 'hon'), ('pyth', 'on'), ('pytho', 'n'), ('python', '')]
But in your Arc version, the result of (map [split ... (- (len word) 1)) omits the last ("python" "")
arc> (def e (word)
(map [split word _] (range 0 (- (len word) 1))))
#<procedure: e>
arc> (e "python")
(("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on")
("pytho" "n"))
arc> (def e (word) ; "Fixed", making it even shorter :-)
(map [split word _] (range 0 (len word))))
*** redefining e
#<procedure: e>
arc> (e "python")
(("" "python") ("p" "ython") ("py" "thon") ("pyt" "hon") ("pyth" "on")
("pytho" "n") ("python" ""))
Makes me think, (- (len xs) 1) is a common idiom in languages w/ zero-based indexing.
Makes me think, maybe 'range should behave like in Python, i.e: if there is only one arg, the `start' is implicitely 0.
---
EDIT: oh my god, there are links to the impls of this spelling corrector in other languages on Norvig's page, and the Perl one is 63 lines (!) and is written in Perl6. I suppose I'd not be able to sleep well before trying to make something shorter (and in Perl5). A Perl program 3 times longer than its Python equivalent, never seen that before!
EDIT2: heck, thinking of it, the list-comprehension and special indexes of Python (word[:i]/word[i:]) may actually not be easy/short to do in Perl...
But. Both this version & pg's one are actually buggy. Obvious example:
arc> (correct "yellow")
"fellow"
Canonical version gives the expected answer, "yellow". So I'd say, no, these are not correct implementations (the Perl version seems to always give the same results than the Arc one).
A thing I have remarked is that the list returned by 'edits{1|2} contains the original value (i.e: (find "python" (edits1 "python")) is t), where this is not the behaviour of Norvig's version. (That's also why the loop in my correct() isn't (for ($win, &edits1($win), &edits2($win)), in contrary to the Arc one).
And this may be incorrect. Or maybe, if $win/word is in nwords, 'correct should stop (immediately). This would - at least - correct the "yellow" -> "fellow" problem.
This is, however, not the only issue. Norvig's version and these versions, given the same "big.txt", don't give the same correction for other words (try "speling", "godd"). And I strongly suppose that Norvig's version is the most correct.
(def known (words) (dedup:keep [nwords _] words)) ; lines count is now 12
(def correct2 (word (o f [nwords _]))
(most f
(or (known:list word) (known:edits1 word) (known:edits2 word) (list word))))
Or:
(def correct3 (word (o f [nwords _])) ; don't need 'known, but require aspirin
(most f (or ((apply orf (map [fn (w) (dedup:keep [nwords _] (_ w))]
(list list edits1 edits2)))
word) (list word))))
arc> (correct{2/3} "yellow")
yellow
arc> (correct{2/3} "godd")
good
arc> (correct{2/3} "speling")
spelling
arc> (correct{2/3} "unnkknnoown")
unnkknnoown
Not exactly the same results than Norvig's version (>>> correct("speling") -> "sling" / "godd" -> "god") but I tested the Ruby version linked on his site, and it yields the same results. Note that the result for "speling" is not really good in canonical version. Maybe it's because the order on Python sets is different from the one in Ruby/Arc lists. I should port the test program of Norvig to stop worrying, but it's OK for now. For now, let's say this version is better than the Norvig's one (!!!)
Not exactly the same results than Norvig's version (>>> correct("speling") -> "sling" / "godd" -> "god") but I tested the Ruby version linked on his site, and it yields the same results.
I.e: Your / my "correction" of pg's version give "spelling" and "good".
And pg's vanilla version for (correct "yellow") gives "fellow".
Sorry, I noticed your post after I posted in haste. Nice work - 12 lines is still darn impressive. Followup with a comment if you use Norvig's test program and notice any other issues.
I would say, the code is put into lines naturally--the Arc edits2 definition is short enough that it fits comfortably on one line, while the Python known_edits2 definition probably does not--and then you count up the lines. (I say "probably" because perhaps some people would tolerate lines that long in their code. But I wouldn't, and it seems that neither did the person who wrote it.) What stops you from putting it all on one line and declaring victory? The fact that that's not how humans naturally read or write code; the first thing I'd do when reading such a program would be to separate it back into lines.
Anyway, if you want a more precisely standardizable measure of program length, try token count. That can't be fooled with, either by deleting whitespace or by giving things shorter names. (Though, if you wanted, you could take the program as a string, replace all whitespace with a special character so it's all one token, write a macro that converts the string back into real code, and thus convert an arbitrarily long program into one with a couple dozen tokens. Hmm. Is there any standardized measure of program length that can't be fooled with?)
> "I would say, the code is put into lines naturally"
I beg to differ. I've being reading pg's code and feel quite comfortable saying that his natural coding would have a line break in the example above - as would any of ours (unless you are unaware of the two spaces required for code on this forum - haha :)
Not a criticism of pg or his code. I think most the submissions to Norvigs site contain code where people are removing "natural" line breaks.
"Perhaps we should stop at 15 lines before we start getting unreadable code :) !"
speaks the truth.
I don't think there's any real value in creating a table ranking languages by line numbers. I do think the code pg produced does demonstrate how pleasant arc is to read while being very short.
This implementation is not quite correct because it doesn't do insertion at the end. If the last letter of a word is missing, the Arc will correct it by adding a letter at the next-to-last place and then swapping the last two letters; however, if the word has another error, the Arc won't correct it, whereas the Python will.
Whether this is a serious problem is open to question. However, it does mean that the comparison with Python is not apples to apples.
Incidentally, in case anyone cares, here's how it looks in rkts-lang,* a little toy language I've been working on:
nwords = words "big.txt" . map downcase . Hap.counts
alphabet = "abcdefghijklmnopqrstuvwxyz"
edits1 w = Iter as c:
n = size w
for i of 0 to n-1: c (w[below i] ... w[above i])
for i of 0 to n-2: c (w[i to i+1] @ w[i+1 downto i])
for i of 0 to n-1: for l of alphabet: c (w[i] @ l)
for i of 0 to n: for l of alphabet: c (w[below i] ... l ... w[upfrom i])
edits n w = iterate (join edits1) {w} . take n+1
correct w = edits 2 w . map (keep nwords[]) . find ~empty . greatest nwords[]
If we permit the pg bug, then edits1 can be shortened a bit:
edits1 w = Iter as c:
for i of keys w:
c (w[below i] ... w[above i])
if i > 0: c (w[i-1 to i] @ w[i downto i-1])
for l of alphabet: c (w[i] @ l); c (w[below i] ... l ... w[upfrom i])
(* Suggestions for a catchy name may be rewarded.)