Page 1 of 1

Members of "word character" set?

PostPosted: Sun Jan 21, 2007 2:00 pm
by bobkoure
Other than a-Z,A-Z,0-9, which characters are considered to be part of the "word char" ( \w ) set in the regex used by newsbin?

I'm asking because, for instance, the underscore char '_' is part of the \w set, but the underscore is often used as a separator char, so, if you were looking for the artist "john doe", and were semi-clever with regex, you might use the filter "john\W*doe" - but that misses "john_doe" - so you use "john[\W_]doe" (or "john[^\w_]*doe" if using a negated set in another set strikes you as weird).

So... I'm wondering what other characters might be part of \w - especially those that might be commonly used as separators.

... and thanks!
Bob

PostPosted: Sun Jan 21, 2007 11:01 pm
by FrizzleFry
According to the RE Cheat Sheet

\w
Matches any word character. Equivalent to the Unicode character categories [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9].

PostPosted: Mon Jan 22, 2007 4:26 pm
by bobkoure
Care to decipher that?
For instance, \p{P} is "punctuation", but \p{Pc} is...?
I assume you do know what the symbols you're quoting mean...