Arc Forumnew | comments | leaders | submitlogin
2 points by almkglor 5713 days ago | link | parent

Probably one potential problem would be the splitting of a string into words. A minor problem is that of figuring out what a "word" is, i.e. the division between words.

Otherwise looks like a pretty standard Bayesian analysis, which I believe pg has done already.



1 point by fallintothis 5711 days ago | link

Not really. It's a simple regexp that Norvig uses.

Python:

  def words(text): return re.findall('[a-z]+', text.lower())
Arc:

  (def words (text) (tokens (downcase text) [~<= #\a _ #\z]))

-----