Page 1 of 1
filter for something@something.something
Posted:
Tue Feb 14, 2017 11:41 am
by sly001
Can anyone write a regex to match any email address where the three parts repeat, like something@something.something or abc77@abc77.abc77 or e74514a39@e74514a39.e74514a39. Each part can consist of upper/lower letters and numerals and be of any length. The only constants are the @ and the dot.
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 4:35 pm
by tl
sly001 wrote:Can anyone write a regex to match any email address where the three parts repeat, like something@something.something or abc77@abc77.abc77 or e74514a39@e74514a39.e74514a39. Each part can consist of upper/lower letters and numerals and be of any length. The only constants are the @ and the dot.
Newsbin uses PCRE regexps with numbered capture groups disabled.
Based on a online regexp tester I think this should do what you want:
(?<first>[A-Za-z0-9]+)@\k<first>\.\k<first>
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 5:25 pm
by sly001
Awesome. That seems to work. Thank you.
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 5:50 pm
by sly001
Maybe I spoke too soon. It seemed to work, but not as a filter to reject the recent influx of spam postings. Here's what I did:
1 - Created new filter set to REJECT if POSTER contains (?<first>[A-Za-z0-9]+)@\k<first>\.\k<first>
2 - Add that as a Header Filter to the Unsorted group
But posts with the poster like 0a86be9a6 <0a86be9a6@0a86be9a6.0a86be9a6> still got through, even though this should have been caught by this filter. I can create a filter that says to ACCEPT and I can see the posts are filtered to show only these posters. So it works to accept them, but not as a reject pre-database write. Any ideas?
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 6:02 pm
by dexter
The only way to effectively handle this pattern is with backreferences. Unfortunately they seem to be disabled in Newsbin. Quade has an item on his list to look into this. If we can get backreferences enabled then this is the RE you would use:
([0-9a-z]+)[ ]\<\1\@\1\.\1\>
There is a space in the square brackets to catch the space before the email portion.
Other than that, if the repeating portions are all the same length, 9 characters in your example, you could do this:
[0-9a-z]{9}[ ]\<[0-9a-z]{9}\@[0-9a-z]{9}\.[0-9a-z]{9}\>
You don't need to include A-Z because RE's in Newsbin are case insensitive.
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 6:08 pm
by sly001
They are not always 9 characters. I've seen them vary from 6 to 12 - but really they can be any length. Guess I just need to wait for a NB update where backreferences are enabled.
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 8:37 pm
by dexter
If they are 6-12, you could do:
[0-9a-z]{6,12}[ ]\<[0-9a-z]{6,12}\@[0-9a-z]{6,12}\.[0-9a-z]{6,12}\>
Re: filter for something@something.something
Posted:
Tue Feb 14, 2017 9:38 pm
by sly001
Thank you. In trying to future-proof this:
- What is the range I can use? Can I do from 1-99 characters with {1,99} instead of {6,12}?
- Is this currently case-insensitive, or would I need to change it to [0-9a-zA-Z]?
- Is is possible to include special characters in addition to letters/numerals?
Re: filter for something@something.something
Posted:
Wed Feb 15, 2017 1:40 pm
by Quade
1 - yes - To me {1,99} is kinda pointless. If you mean "all" then ".*" is probably better. The power of the curly braces is being able to set minimum and maximum ranges. The spam has at least N characters of number/letters with no space so you're better off setting a minimum threshold for length. Some size smaller than the maximum but as large as possible so you don't catch too much. Filtering headers, if you filter too much you'll just lose records and won't really even know you lost them.
2 - Not case sensitive.
3 - yes but keep in mind that come characters have to be escaped.
[] needs to be escaped as \[\]
for example.
Here is the list of characters that need to be escaped to use them as normal literals:
Opening square bracket [
Backslash \
Caret ^
Dollar sign $
Period or dot .
Vertical bar or pipe symbol |
Question mark ?
Asterisk or star *
Plus sign +
Opening round bracket ( and the closing round bracket )
These special characters are often called "metacharacters".
Found:
http://stackoverflow.com/questions/1296 ... -net-regexKeep in mind with regex's you only have to match on some minimum. You don't need to match the whole string.
If you reject a 40-90 character run of numbers and letters with no spaces, that's enough to block this spam.
Re: filter for something@something.something
Posted:
Wed Feb 15, 2017 4:05 pm
by sly001
So maybe I'm misunderstanding. This regex:
[0-9a-z]{6,12}[ ]\<[0-9a-z]{6,12}\@[0-9a-z]{6,12}\.[0-9a-z]{6,12}\>
catches this:
spam123 <spam123@spam123.spam123>
But will it also catch this?
notspam <6chars@gooddomain.abc123>
If so, then it's not what I want. I need to match repeated phrases - like where spam123 is used in each 'block'. Which sounds like backreferences - which are currently unsupported. So - what do you suggest is the best way to filter out this spam where the poster email address is continually changing, yet follows the pattern of Links not allowed for unregistered users?
Re: filter for something@something.something
Posted:
Wed Feb 15, 2017 4:19 pm
by dexter
Yeah, it will match "notspam <6chars@gooddomain.abc123>". That's why I said the best solution would be if Newsbin supported back-references.
Until that happens, the only other way would be to find some pattern in the subject that is common to all these posts.
Re: filter for something@something.something
Posted:
Wed Feb 15, 2017 4:32 pm
by sly001
damn
Re: filter for something@something.something
Posted:
Wed Feb 15, 2017 5:41 pm
by Quade
catches this:
spam123 <spam123@spam123.spam123>
You know filtering out email addresses is failure prone. You're better off just filtering in either the posters you like or the subjects you like ("\[FULL\]" for example)
Many spammers are using random posting fields so it's impossible to match them all. Better to filter IN what you like than trying to filter OUT what you don't like.