RegExp Bugs With Accented Characters

During the course of developing the Spelling Plus Library, and more recently while adding multilingual support to it, I discovered two serious bugs with the Regular Expression implementation in ActionScript, and how it handles accented characters.

First, RegExp in AS3 does not include accented characters in the word character class. For example, the pattern /\w+/ (match one or more word characters) matches “r” and “sume” in “résume”, when it should match the full string. UPDATE: Arthur has pointed out in the comments that this is correct according to the ECMAScript and POSIX RegEx specifications. \w is intended to match just the set [a-zA-Z0-9_] , which it does in AS3. With that being understood, it would be nice to have support for unicode property sets (which allow you to match word characters in any language, among other things), but I can understand that this may have an unacceptable impact on the size of the Flash Player.

Secondly, there is a somewhat obscure problem with how the Flash player matches \S and accented characters. Specifically, it appears that it does not count accented characters properly when matching them to \S, and this results in weird results. This is not the case with the negated whitespace character set [^\s], although these sets should exhibit identical behaviour in RegEx. This issue is pretty weird, so I’ll give a few examples:

  1. the pattern /\S+/ (one or more not-whitespace chars) will match the full string of “é aé”, when it should match “é” and “aé” separately.

  2. the same pattern /\S+/ will match “aé” and “bé” correctly for the string “aé bé”.

  3. the pattern /\S{2,}/ (two or more not-whitespace chars) will match the full string “aé bcé” when it should match “aé” and “bcé”.

  4. the same pattern /\S{2,}/ will only match “bcé” for the string “éa bcé”, when it should match “éa” and “bcé”

All of the above work properly if you substitute [^\s] for \S.

Hopefully this is helpful for other people working with RegExp, especially with languages other than English. It is quite frustrating to work around – I ended up writing a specialized character lexer instead of using RegExp in SPL.

Know of any other RegExp bugs in AS3? Share them in the comments.

Grant Skinner

The "g" in gskinner. Also the "skinner".

@gskinner

12 Comments

  1. RegExp Bugs With Accented Characters

    Bookmarked your post over at Blog Bookmarker.com!

  2. A wild guess as to the problem with \S matching too many characters: it has a problem with som cases of multi-byte character runs, which wouldn’t be very surprising since regexps suck at non-ascii on all systems I’ve used them in.

    The regexp engine in Firefox seems to handle all the \S+ cases (although it has the same basic problem of \w not matching accented characters).

  3. Theo,

    Yes, this was my thought too. It’s not counting the multi-byte character correctly in this case for some reason. Matching the trailing space is a little strange as well, but is likely related to the same problem. My guess would be the counting problem causes it to skip trying to match the space character completely.

  4. Hi Grant.

    Regarding accented letters: while this is a bit counter intuitive, it’s actually part of the ECMA 262 specs. The character class is just a shortcut for the a-z, A-Z, 0-9 ranges + “_” , which does not include accented letters.

    Cheers

    Arthur Debert

    [1] The spec http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

    [2] The POSIX regex spec : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html

  5. Arthur,

    Right you are – my bad. I guess the problem then is that AS3’s RegExp implementation does not include support for any extended character classes (ex. unicode property sets), though I can understand that this may be due to file size implications in the player.

    I’ll update the article to reflect this.

  6. I´ve pointed this bug 1 year ago, but no one listen to me. I hope they listen you now!

  7. I live in México and since the first Flex sdk came out I realise about this bug. Today is a habitual practice to use more complicated RegExp to do something with spanish text.

  8. I think I’ve found another regex bug:

    Any idea how

    /^(.*)-(.*)$/

    doesn’t find

    aaaa – bbbb

  9. Great list, it helps clear up much of the htacess mystery and confusion that comes from creating such files.

  10. Nikos – I just tested that pattern in RegExr, and it seems to work fine for me.

  11. String#replace accepts a function as a second argument. The function will have arguments for the match, and the index in the string where the match begins (and another for the entire string). But for unicode characters, the index is wrong (or at least, not what I would expect!)

    trace(“_x_x”.replace(/x/g, function(match, i, str) {

    trace(i)

    trace(str.charAt(i), str.charAt(i) === “x”);

    return “_”;

    }));

    trace(“_™_™”.replace(/™/g, function(match, i, str) {

    trace(i);

    trace(str.charAt(i), str.charAt(i) === “™”);

    return “_”;

    }));

  12. I’m not sure if it’s on the RegEx side or the TextField, but when getting the index position of a match it seems to be off if there are special characters such as em-dash or smart single quotes. As with the em-dash, it looks like it offsets the index position by 3 characters.

Leave a Reply

Your email address will not be published. Required fields are marked *