April 29, 2008February 2, 2015 by Grant Skinner

RegExp Bugs With Accented Characters

During the course of developing the Spelling Plus Library, and more recently while adding multilingual support to it, I discovered two serious bugs with the Regular Expression implementation in ActionScript, and how it handles accented characters.

First, RegExp in AS3 does not include accented characters in the word character class. For example, the pattern /\w+/ (match one or more word characters) matches “r” and “sume” in “résume”, when it should match the full string. UPDATE: Arthur has pointed out in the comments that this is correct according to the ECMAScript and POSIX RegEx specifications. \w is intended to match just the set [a-zA-Z0-9_] , which it does in AS3. With that being understood, it would be nice to have support for unicode property sets (which allow you to match word characters in any language, among other things), but I can understand that this may have an unacceptable impact on the size of the Flash Player.

Secondly, there is a somewhat obscure problem with how the Flash player matches \S and accented characters. Specifically, it appears that it does not count accented characters properly when matching them to \S, and this results in weird results. This is not the case with the negated whitespace character set [^\s], although these sets should exhibit identical behaviour in RegEx. This issue is pretty weird, so I’ll give a few examples:

the pattern /\S+/ (one or more not-whitespace chars) will match the full string of “é aé”, when it should match “é” and “aé” separately.
the same pattern /\S+/ will match “aé” and “bé” correctly for the string “aé bé”.
the pattern /\S{2,}/ (two or more not-whitespace chars) will match the full string “aé bcé” when it should match “aé” and “bcé”.
the same pattern /\S{2,}/ will only match “bcé” for the string “éa bcé”, when it should match “éa” and “bcé”

All of the above work properly if you substitute [^\s] for \S.

Hopefully this is helpful for other people working with RegExp, especially with languages other than English. It is quite frustrating to work around – I ended up writing a specialized character lexer instead of using RegExp in SPL.

Know of any other RegExp bugs in AS3? Share them in the comments.

Grant Skinner

The "g" in gskinner. Also the "skinner".

@gskinner

12 Comments

with April 30, 2008 at 1:46am

RegExp Bugs With Accented Characters

Bookmarked your post over at Blog Bookmarker.com!
Theo April 30, 2008 at 2:03am

A wild guess as to the problem with \S matching too many characters: it has a problem with som cases of multi-byte character runs, which wouldn’t be very surprising since regexps suck at non-ascii on all systems I’ve used them in.

The regexp engine in Firefox seems to handle all the \S+ cases (although it has the same basic problem of \w not matching accented characters).
Grant Skinner April 30, 2008 at 8:50am

Theo,

Yes, this was my thought too. It’s not counting the multi-byte character correctly in this case for some reason. Matching the trailing space is a little strange as well, but is likely related to the same problem. My guess would be the counting problem causes it to skip trying to match the space character completely.
Arthur Debert April 30, 2008 at 8:52am

Hi Grant.

Regarding accented letters: while this is a bit counter intuitive, it’s actually part of the ECMA 262 specs. The character class is just a shortcut for the a-z, A-Z, 0-9 ranges + “_” , which does not include accented letters.

Cheers

Arthur Debert

[1] The spec http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

[2] The POSIX regex spec : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html
Grant Skinner April 30, 2008 at 9:15am

Arthur,

Right you are – my bad. I guess the problem then is that AS3’s RegExp implementation does not include support for any extended character classes (ex. unicode property sets), though I can understand that this may be due to file size implications in the player.

I’ll update the article to reflect this.
Marcos Neves April 30, 2008 at 9:55am

IÂ´ve pointed this bug 1 year ago, but no one listen to me. I hope they listen you now!
Quantium April 30, 2008 at 11:03am

I live in MÃ©xico and since the first Flex sdk came out I realise about this bug. Today is a habitual practice to use more complicated RegExp to do something with spanish text.
Nikos Katsikanis October 9, 2008 at 2:29am

I think I’ve found another regex bug:

Any idea how

/^(.*)-(.*)$/

doesn’t find

aaaa – bbbb
clearance london November 28, 2008 at 9:41am

Great list, it helps clear up much of the htacess mystery and confusion that comes from creating such files.
Grant Skinner December 5, 2008 at 10:13am

Nikos – I just tested that pattern in RegExr, and it seems to work fine for me.
Eric Skogen September 3, 2009 at 10:50am

String#replace accepts a function as a second argument. The function will have arguments for the match, and the index in the string where the match begins (and another for the entire string). But for unicode characters, the index is wrong (or at least, not what I would expect!)

trace(“_x_x”.replace(/x/g, function(match, i, str) {

trace(i)

trace(str.charAt(i), str.charAt(i) === “x”);

return “_”;

}));

trace(“__”.replace(//g, function(match, i, str) {

trace(i);

trace(str.charAt(i), str.charAt(i) === “”);

return “_”;

}));
Eric Decker October 2, 2009 at 1:36pm

I’m not sure if it’s on the RegEx side or the TextField, but when getting the index position of a match it seems to be off if there are special characters such as em-dash or smart single quotes. As with the em-dash, it looks like it offsets the index position by 3 characters.

Comments are closed.

Grant Skinner

If you enjoyed this, you might like these articles…

12 Comments