Arc Forumnew | comments | leaders | submitlogin
Norvig's spelling corrector in Arc?
4 points by adm 4366 days ago | 5 comments
anybody tried doing Arc version?

1 point by lojic 4309 days ago | link

I found the following implementation in Clojure which compares much more favorably to Norvig's Python version than the other Scheme/Lisp versions I've seen. Still curious about how Arc might compare.


  (defn words [text] (re-seq #"[a-z]+" (. text (toLowerCase))))

  (defn train [features]
    (reduce (fn [model f] (assoc model f (inc (get model f 1)))) 
            {} features))

  (def *nwords* (train (words (slurp "big.txt"))))

  (defn edits1 [word]
    (let [alphabet "abcdefghijklmnopqrstuvwxyz", n (count word)]
      (distinct (concat
        (for [i (range n)] (str (subs word 0 i) (subs word (inc i))))
        (for [i (range (dec n))]
          (str (subs word 0 i) (nth word (inc i)) (nth word i) (subs word (+ 2 i))))
        (for [i (range n) c alphabet] (str (subs word 0 i) c (subs word (inc i))))
        (for [i (range (inc n)) c alphabet] (str (subs word 0 i) c (subs word i)))))))

  (defn known [words nwords] (for [w words :when (nwords w)]  w))

  (defn known-edits2 [word nwords] 
    (for [e1 (edits1 word) e2 (edits1 e1) :when (nwords e2)]  e2))

  (defn correct [word nwords]
    (let [candidates (or (known [word] nwords) (known (edits1 word) nwords) 
                         (known-edits2 word nwords) [word])]
      (apply max-key #(get nwords % 1) candidates)))


2 points by almkglor 4365 days ago | link

Probably one potential problem would be the splitting of a string into words. A minor problem is that of figuring out what a "word" is, i.e. the division between words.

Otherwise looks like a pretty standard Bayesian analysis, which I believe pg has done already.


1 point by fallintothis 4363 days ago | link

Not really. It's a simple regexp that Norvig uses.


  def words(text): return re.findall('[a-z]+', text.lower())

  (def words (text) (tokens (downcase text) [~<= #\a _ #\z]))


2 points by cluelessmoron 4242 days ago | link

      (bayes-train (map lower-case (parse (read-file {big.dms}) {\W} 0)) 'Lexicon)
      (define (edits1 word) 
       (let ((l (length word)) (res '()))
        (for (i 0 (- l 1)) (push (replace (nth i (string word)) (string word) "") res -1))
        (for (i 0 (- l 2)) (push (swap i (+ i 1) (string word)) res -1))
        (for (i 0 (- l 1)) (for (c (char "a") (char "z")) 
          (push (replace (nth i (string word)) (string word) (char c)) res -1)
          (push (replace (nth i (string word)) (string word) (string (char c) (nth i (string word)) )) res -1)))
      (define (edits2 word) (unique (flat (map edits1 (edits1 word)))))
      (define (known-edits2 word) (filter (fn (w) (Lexicon w)) (edits2 word)))
      (define (known words) (filter (fn (w) (Lexicon w)) words))
      (define (correct word)
        (let ((candidates (or (known (list word)) (known (edits1 word)) (known (edits2 word)))))
          (first (sort candidates (fn (w1 w2) (> (Lexicon w1) (Lexicon w2)))))))

      (join (map correct (parse "tihs sentnce is ful of inkorrect spelingz")) " ")
      this sentence is full of incorrect spelling
that's newLISP - should be easy to convert...


1 point by lojic 4362 days ago | link

Here's a translation of Norvig's code to Ruby for comparison: