Page 1 of 1
Next step in Newsbin filtering evolution?
Posted:
Sat Jun 11, 2005 4:33 am
by macraig
Okay, so I give up... is there currently no effective way to discriminate newsgroup spam? My POPFile idea went pop, too. So how do we humans recognize the spammers, then, and how do we teach that to Newsbin and other software?
In the short term, what about a poster whitelist, as opposed to a poster lockout? This could be implemented manually or automagically through some intelligent pattern-matching. Obviously Newsbin can't analyze binaries themselves to determine intent, but the habits of posters themselves over even a short periods of time can clearly identify them. That may not be true of all newsgroups, but it is certainly true of some.
At least in some binaries newsgroups, there is a readily identifiable minority of regulars, who are clearly NOT spammers: they post frequently and/or in large amounts, and almost universally avoid spammy words in subjects. Then there's the rest, who post too infrequently or don't post enough to make their motives clear. In such newsgroups, even the formerly innocent requests for binary ID are now an invitation to spam, so simple word or phrase matching can't really vet even those infrequent posters. Still, the best and brightest of these groups stand out sharply. If we humans are able to pick them out with ease, why can't an algorithm?
If Newsbin were to log the quantity, type, quality, and frequency of what each poster submits over some period of time, clear patterns would begin to emerge, and those patterns would enable identifying at least the best that the newsgroup had to offer. If applied too rigidly, it would exclude new or infrequent but legit posters, but still the method is sound. It's sound because it's exactly what many of us are doing almost unconsciously already to make those same judgements ourselves.
I'm no longer enough of a programmer to express something like this process in contemporary code myself, but even I can visualize bits of the algorithm as I write this. I'm sure that one or a few good coders could do the idea justice. If that were implemented in Newsbin, it would place it an order of magnitude past its competition (and you know some spies are probably reading this, too).
Posted:
Sat Jun 11, 2005 10:23 am
by DThor
You're talking about determining patterns literally just from a Subject Line, and perhaps a "From" line. Problem is, that's not a lot of data to go on. When you look at the money being spent out there(and it's a *lot*) at corporations and other places just to try to keep email spam in check, when this often involves implementing restrictions on the end user *and* they have the luxury of grabbing the body of the message, you're asking for quite the challenge. Sporging is specific...it's often used to target particular groups and certain types of posts, so they'll interject their stuff among the "good" stuff(although that's getting harder to measure nowadays). Spam OTOH is a huge, fat fly swatter. That adds to the challenge.
I'm not saying it's impossible, it's just that when you look at the demographic(email users vs usenet) and the challenge(usenet tends to be more focused attacks then large scale swamping), I think you'll find it's a lot of work for not a lot of people, that are notoriously cheap.
There's an incentive for someone to come up with clever spam-killers(and they have) because *everyone* uses email. Same can't be said for usenet.
DT
Posted:
Sat Jun 11, 2005 5:51 pm
by macraig
No, I wasn't talking about analyzing nothing more than the subject line. You might understand better if you read my fourth paragraph in particular again.
Perfecting the algorithm for this would require some effort, but what doesn't? A regular-expression parser isn't trivial code either, and what I'm suggesting promises to have at least as much utility to Newsbin users as it does.
Hey, even if you think the automatic algorithm is impossible, a manually created and managed poster whitelist could still be an enormous help.
Posted:
Sun Jun 12, 2005 9:21 am
by DThor
[EDIT] Sorry, nasties removed in response to nasties removed...
Sporging and spam come from constantly changing emails. That's part of the process. In the meantime, you can write a script to find heuristic patterns in email generation programs and use that to generate entries in the Lockout Posters database, but I sure won't use it. Same goes for someone else's idea of a bad poster list. I'd rather use Computer Number 1 for that, thank you. If I find the same sort of spam appearing from the same poster(if I notice it - rarely), then I just throw it on the list.
And hey - just to be clear - I wasn't intending on totally shooting down the idea, I was just pointing out the challenge, which didn't seem apparent that you knew in your post.
DT
Posted:
Sun Jun 12, 2005 10:03 am
by Quade
Yeah, MC you've got an attitude. I noticed it in another post of yours, the faq entry one where Smite tells you the answer and you attacked him. It's one of the reasons I'm not getting involved in this discussion.
While I find what you're saying interesting, the message is being overshadowed by your fairly nasty reply to the only person seeking to engage you in discussion. Perhaps, you should have clarified what you said instead of attacking.
You've already alienated Smite who, while I don't see eye to eye with him, knows his shit and you're on the way to doing the same thing with Dthor who likewise has opinions I respect. It really didn't have to be that way either.
Posted:
Mon Jun 13, 2005 6:30 am
by macraig
Quade:
If your goal is to encourage constructive conversation, is deliberately dragging a thread off into a personal denunciation the proper way to do it? However "overshadowed" this conversation was by my smart-assed correction of DThor's reading of what I said, your intrusion into it has fairly doomed it. DThor did in fact completely miss some of what I said, and I tried to point it out with a nudge in the ribs. I didn't think that constituted excessive force; in any case, your reply certainly did. DThor still doesn't seem to understand what I was suggesting, and I get frustrated when what I think is perfectly clear and articulate English is being misread; when I get frustrated and impatient, I sometimes get smart-assed.
Fine, so I misinterpreted Smite's reply in another thread - not at all related to this one - and thought he was being condescending when he was simply being terse/lazy. My mistake, mostly. What's wrong with you that you can't separate one thread from another and a person's personality from intellect? Why can't you consider the merit of the ideas separate from personality? Lord knows the IT world is full of people who fit somewhere on the autistic spectrum (and I'm one of them), so expecting all of them to have perfect social skills and charming personality is going to leave you pretty frustrated.
Publicly ignoring a well-intentioned suggestion and burying the conversation with a very personal and insulting tangent seems like a pretty poor way to handle the situation, even if my personality or social skills could use improvement. Even if you're frustrated by someone's personality, is that really good enough excuse to dismiss them entirely out of hand, especially if it's clear they're trying to rise above it and contribute something?
I see no further use for this thread than to fight overreaction with equal and opposite reaction. To anyone expecting further refinement of the ideas, I apologize. Maybe someone can remember them and suggest them in their own name and with their own words six months from now.
Posted:
Mon Jun 13, 2005 7:59 am
by itimpi
I HAVE read the various posts, and I cannot see any reasonable algorithm that could support this. I'm not saying that it is impossible - just that it would be extremely hard to come up with something that does the job.
For most people it will probably become a moot point anyway with the move towards using indexing sites to avoid downloading headers. It then becomes the responsibility of such sites to manage what material gets filtered.
Posted:
Mon Jun 13, 2005 8:49 am
by Quade
I actually see how I could implement something along these lines. I wouldn't implement it as a spam filter, it would in fact be a learning filter that determines what you like to download and allows you to show only that. So effectively it's the same thing, that looks at the subject field only and your download history.
http://spambayes.sourceforge.net/
Another check box on the filter bar.
The problem is, the subject, which is the only thing you can really work with is somewhat limited. Maybe roll the from field in there too.
As for the other thing, the simple answer is to re-read your own posts and remove any kind of personal attack. I've tried to make it clear, it irks me. Why you do it, doesn't matter. Just don't do it and I'll be happy. I edit out nasty comments in my own posts all the time, there's no reason you can't do it too.
Posted:
Mon Jun 13, 2005 11:31 am
by macraig
I thought an automatic "whitelist" could be implemented by tracking the behavior of posters over short periods of time. By tracking their name (and its structure) along with how many "sets" they post and how large those sets are - how many files and the size of those files - in a given period of time, say a week or month, an algorithm might begin to pick out the regular contributors to a newsgroup. Spammers (sporging is a new term to me) almost always post shorter sets of smaller files, and almost always use subjects that differ in wording from the regulars, with exceptions. It would learn, as Quade said, by watching the patterns, the relationships between these different factors.
That's not very detailed, because I still have only a vaguely formed idea. The basic point, though, is that when a human first begins using binaries newsgroups he knows nothing of how to discriminate the good from the bad posts, yet in spite of that he'll quickly learn exactly how to do it, using the factors mentioned above and perhaps a few others. This can happen in the span of a week or month, maybe even a few days. He learns by recognizing the relationships between good and bad posts and those factors, not so much alone as in combination. That might even be a crude definition of Bayesian analysis. Unlike what Quade said about it learning specifically what one person likes, I think it would learn very broadly what all non-spammers like - a general feel for what is "good".
More often than not, for instance, I can easily tell that a series of posts is from a regular poster to the group and not a spammer, but *not* with certainty whether it's something I want to retrieve, mostly because too little detail is offered in the subject or I can't recognize what's being offered. In such cases (and especially with broadband) I have two choices: (1) sample the series and mark it as read if I don't want it, or (2) simply retrieve it anyway knowing at least that it's not spam, and simply delete the files if they're not something I want. That way at least they're (still) in the signature cache, so future encounters with the same files will be ignored; if I were to mark the series as read and skip it, I may encounter it again in the future and have to rediscover the same choice again.
That's what I envisioned this algorithm being able to do: pick out and identify the regular posters, whitelist them perhaps, and download all posts from them... not necessarily only the ones I might personally want. That's less work and could still be an improvment over trying to write regexp filters and use poster lockouts to do the same. I can't code it, but I'll bet somebody else can.
Maybe Itimpi is right that the new (NZB?) indexing will become the way things are done to avoid spam. I haven't spent the time yet to investigate and understand what that's all about, so my suggestion was borne of more traditional practices. If these NZB files and indexing prove to be a better method, I'll vote for that instead and against what I've suggested. Solving the problem is all that matters, not who or what solved it.
Posted:
Mon Jun 13, 2005 12:02 pm
by Quade
I have no problem with a whitelist. It makes sense to me as a first step. Same thing, some kind of quick checkbox so you can see it with and without.
As en evolution of usenet, you might remove the need to even have a group as such. Most of the large groups are just dumping grounds for random crap anyway. Then the servers become large pools of posts that can't be found other than if you already know they're there (NZB's for example). Then NZB's become the .torrents of usenet.
You could actually do that now by posting to a random assortment of groups so, the files aren't complete in any one group but, are spread out over 50 groups. NZB's would still extract them as would any program that can scan 50 groups for matching posts.
If you take that to the next step, the subject doesn't even need to be meaningful. Then nobody who doesn't have the NZB can tell what's been posted or how to group the posts into files.
Posted:
Mon Jun 13, 2005 12:10 pm
by DThor
Sporging is different than spam because it's literally an attack, not a method to separate you from your money. Therefore, it's specifically designed to make perusing the groups a PITA. Of course, there's plenty of spam, too.
Part of the problem is if you look at a lot of posts, they'll have some default user name like "poster@sender.com". You don't want to necessarily add that to a whitelist because a lot of people post legitimately with the same addy. Add to that that one person's spam is another person's pr0n, and you get into a difficult area with using addys to deny loading. For example, you might see a bunch of pr0n piccies from "buxombabe.com" that you perceive as spam because they're trying to get you to go to the site and spend money, but plenty of other users think it's a great download.
It can be useful, of course, and there already exists a method to build your own whitelist, but really you're talking about post-processing that list to generate something useful. I guess my point is that most of the systems out there now for doing this(and there's several) would *especially* need lots of human intervention for these reasons.
Frankly, this is something that wouldn't take a helluva lot of work to prototype - no need to do anything fancy right in newsbin yet - just something that processes all the addys and subjects out there - pass them through some of the various systems out there like Spam Assassin, and see what comes out the other end. Personally I'm a little mixed on the success of that, which is why I'm not going to volunteer, but if you're game, I'd be curious.
DT
Posted:
Mon Dec 05, 2005 11:38 pm
by alphadog
What would be cool would be for Newsbin to host a centralized database where all owners of the program could "vote" a post into various categories. Much like the Cloudmark email spam tool. Now that would be a cool value-add feature. I've been using Coudmark for my Outlook spam filter and it is quite good...