During the course of developing the Spelling Plus Library, and more recently while adding multilingual support to it, I discovered two serious bugs with the Regular Expression implementation in ActionScript, and how it handles accented characters.
First, RegExp in AS3 does not include accented characters in the word character class. For example, the pattern /\w+/ (match one or more word characters) matches “r” and “sume” in “résume”, when it should match the full string. UPDATE: Arthur has pointed out in the comments that this is correct according to the ECMAScript and POSIX RegEx specifications. \w is intended to match just the set [a-zA-Z0-9_] , which it does in AS3. With that being understood, it would be nice to have support for unicode property sets (which allow you to match word characters in any language, among other things), but I can understand that this may have an unacceptable impact on the size of the Flash Player.
Secondly, there is a somewhat obscure problem with how the Flash player matches \S and accented characters. Specifically, it appears that it does not count accented characters properly when matching them to \S, and this results in weird results. This is not the case with the negated whitespace character set [^\s], although these sets should exhibit identical behaviour in RegEx. This issue is pretty weird, so I’ll give a few examples:
- the pattern /\S+/ (one or more not-whitespace chars) will match the full string of “é aé”, when it should match “é” and “aé” separately.
- the same pattern /\S+/ will match “aé” and “bé” correctly for the string “aé bé”.
- the pattern /\S{2,}/ (two or more not-whitespace chars) will match the full string “aé bcé” when it should match “aé” and “bcé”.
- the same pattern /\S{2,}/ will only match “bcé” for the string “éa bcé”, when it should match “éa” and “bcé”
All of the above work properly if you substitute [^\s] for \S.
Hopefully this is helpful for other people working with RegExp, especially with languages other than English. It is quite frustrating to work around – I ended up writing a specialized character lexer instead of using RegExp in SPL.
Know of any other RegExp bugs in AS3? Share them in the comments.