Page 1 of 1

Popfile as NNTP proxy filter for Newsbin

PostPosted: Fri Jun 10, 2005 10:56 am
by macraig
I wonder about using Popfile's NNTP extension as a front end for Newsbin?

I ask because Newsbin (or I) needs help filtering newsgroup spam. Spammers have become quite creative with subject and from: information; I can't seem to make Newsbin filters that catch more than a small fraction of it. Then there's the posts that aren't generally spam but which one person or another doesn't want to download. Newsbin filters only have access to subject and body information, and the rest of the information hidden in post headers seems to be wasted. Popfile as an NNTP proxy can make good use of what Newsbin can't, though it can only use subject modification to feed its determinations to Newsbin. Popfile might also be better or more flexible at classifying and excluding non-spam-but-still-undesirable posts.

Has anyone tried it?

Did you set up a separate instance of Popfile for it, or share the same between POP3 and NNTP? You had to set it for subject modification in order to be useful in Newsbin? Did you choose a binary - spam and non-spam - classification system, or more than two? What Newsbin filters did you set up to process the classifications?

PostPosted: Fri Jun 10, 2005 1:34 pm
by kairk
I've never used Popfile, but it sounded interesting so I installed it and configured it and NB5. The big problem I see in trying to use it is that Popfile apparently only sees and works with the messages that you download. Without it working at the header level, it does not seem to be very useful for binary newsgroups.

PostPosted: Fri Jun 10, 2005 2:09 pm
by DThor
I was thinking the same thing...those sorts of learning spam filters need more than just a Subject, and I suspect the NNTP addon is designed to read the whole message, like it was email.

The other problem is that, arguably, it's harder to track what "legit" news posts are then "legit" email posts. Good email tends to have certain characteristics, like the email addy being in your addressbook, no extremely odd subjects, no exe attachments, whereas almost everything I want to download from usenet can't make those claims. :)

I'd certainly love to have a way to filter well, especially during sporge season, but I'm not sure I could trust it!

DT

PostPosted: Fri Jun 10, 2005 4:04 pm
by macraig
Ah, these are things that didn't occur to me; I mentally skipped the important detail of header retrieval and how critical that is when dealing with Usenet and binaries.

So then POPFile-as-NNTP-proxy can't see or act upon header retrieval at all? Technically you can just retrieve headers with POP3 e-mail, too, though in practice almost no one does it any more and I haven't since I've used POPFile. If POPFile's NNTP extension doesn't act upon header retrieval at all, then that does make it useless: even if you retrieved *everything* unselectively for a while and let POPFile "see" it and trained it upon its mistakes, it still wouldn't help when you then reverted to header retrievals because POPFile's modifications to the subject field wouldn't take place during that step and you'd never see the classifications until *after* you'd retrieved the whole message (and binary).

Is that about right? Oh, pooh! :(

PostPosted: Fri Sep 16, 2005 10:39 am
by bobkoure
Popfile (which I use for email, BTW) is a Bayesian filter - which means it tracks statistical word occurrences in emails (both bodies and, I think, subjects). As you tell it how to classify each email it finds the "unusual" word occurrences that make that email different from the ones that are not classified in that group.
IMHO, there aren't enough "words" in the subject/author for it to be able to build a "corpus" of statistically interesting word occurrences.

Of course, if you had access to the body, then you'd probably have pretty good luck with a Bayesian filter. One thought that springs to mind is some kind of site that actually want through all the bodies and then made available "group summaries" - essentially a list of article IDs and statistical word occurrences (noting that you don't actually have to list the words themselves so long as you have an agreed-upon dictionary - and that you can do Huffman-like things to keep the ids for the most common words (actually the most common uncommon-words - no need to list the common ones at all).
Then, you'd need something on the client side, so a user could, using the statistically-significant words in each article, use a Bayesian filter to sort articles into categories they were interested in (Bayesian filters aren't just anti-spam, but can used to sort into whatever categories you like).
The advantage here is that, although there is a remote site filtering newsgroups, the categories would be yours (one man's sporge is another man's honey - or something like that).
Of course, this would probably be something you had to pay for - not sure what the actual costs of a site offering that kind of service might be - probably at least the inbound side bandwidth of any of the major news providers, outbound much less as you're only outputting significant-word-occurrences - and probably as incremental digests for each group - plus (probably) per-group dictionaries every so often). Then there's the cost of storage, processing power, and a couple of humans to watch over the whole thing.
Given that Usenet is not known for folks who like to spend lots of money, I'm not sure anyone's going to find the thought of building something like this terribly attractive. Maybe, though...(?)