Arc Forumnew | comments | leaders | submitlogin
3 points by almkglor 5892 days ago | link | parent

IMO the hard part is the "best guess". ^^ I've been looking for papers about summarization and haven't found much. Hmm. Maybe look at the title and try to fetch words around words in the title, i.e. use the title's terms as search terms.


3 points by antiismist 5892 days ago | link

There is an easy way and a hard way to do it. For most articles, if you take the first element in the DOM that is a paragraph and contains above a certain number of words, then I am guessing that would most times be the leader paragraph.

The second easy way is to do the above most of the time, but have some site-specific things that are used instead.

You could also use some classifying software to ID the proper paragraph. You could have a training set of all the descriptions that have been on the site before, and find text that most matches that text, and use that. Or find the first bit of text that matches beyond a certain threshold, and use that.

The hardest way is to automatically generate a summary. I work in the automated document analysis business, and this is indeed pretty hard to do.

-----