Arc Forumnew | comments | leaders | submitlogin
Norvig's spelling corrector in Arc?
4 points by adm 4156 days ago | 5 comments
anybody tried doing Arc version?

1 point by lojic 4099 days ago | link

I found the following implementation in Clojure which compares much more favorably to Norvig's Python version than the other Scheme/Lisp versions I've seen. Still curious about how Arc might compare.


  (defn words [text] (re-seq #"[a-z]+" (. text (toLowerCase))))

  (defn train [features]
    (reduce (fn [model f] (assoc model f (inc (get model f 1)))) 
            {} features))

  (def *nwords* (train (words (slurp "big.txt"))))

  (defn edits1 [word]
    (let [alphabet "abcdefghijklmnopqrstuvwxyz", n (count word)]
      (distinct (concat
        (for [i (range n)] (str (subs word 0 i) (subs word (inc i))))
        (for [i (range (dec n))]
          (str (subs word 0 i) (nth word (inc i)) (nth word i) (subs word (+ 2 i))))
        (for [i (range n) c alphabet] (str (subs word 0 i) c (subs word (inc i))))
        (for [i (range (inc n)) c alphabet] (str (subs word 0 i) c (subs word i)))))))

  (defn known [words nwords] (for [w words :when (nwords w)]  w))

  (defn known-edits2 [word nwords] 
    (for [e1 (edits1 word) e2 (edits1 e1) :when (nwords e2)]  e2))

  (defn correct [word nwords]
    (let [candidates (or (known [word] nwords) (known (edits1 word) nwords) 
                         (known-edits2 word nwords) [word])]
      (apply max-key #(get nwords % 1) candidates)))


2 points by almkglor 4155 days ago | link

Probably one potential problem would be the splitting of a string into words. A minor problem is that of figuring out what a "word" is, i.e. the division between words.

Otherwise looks like a pretty standard Bayesian analysis, which I believe pg has done already.


1 point by fallintothis 4153 days ago | link

Not really. It's a simple regexp that Norvig uses.


  def words(text): return re.findall('[a-z]+', text.lower())

  (def words (text) (tokens (downcase text) [~<= #\a _ #\z]))


2 points by cluelessmoron 4032 days ago | link

      (bayes-train (map lower-case (parse (read-file {big.dms}) {\W} 0)) 'Lexicon)
      (define (edits1 word) 
       (let ((l (length word)) (res '()))
        (for (i 0 (- l 1)) (push (replace (nth i (string word)) (string word) "") res -1))
        (for (i 0 (- l 2)) (push (swap i (+ i 1) (string word)) res -1))
        (for (i 0 (- l 1)) (for (c (char "a") (char "z")) 
          (push (replace (nth i (string word)) (string word) (char c)) res -1)
          (push (replace (nth i (string word)) (string word) (string (char c) (nth i (string word)) )) res -1)))
      (define (edits2 word) (unique (flat (map edits1 (edits1 word)))))
      (define (known-edits2 word) (filter (fn (w) (Lexicon w)) (edits2 word)))
      (define (known words) (filter (fn (w) (Lexicon w)) words))
      (define (correct word)
        (let ((candidates (or (known (list word)) (known (edits1 word)) (known (edits2 word)))))
          (first (sort candidates (fn (w1 w2) (> (Lexicon w1) (Lexicon w2)))))))

      (join (map correct (parse "tihs sentnce is ful of inkorrect spelingz")) " ")
      this sentence is full of incorrect spelling
that's newLISP - should be easy to convert...


1 point by lojic 4152 days ago | link

Here's a translation of Norvig's code to Ruby for comparison: