Here’s some quick code to parse individual tags and text elements out of an html string. It might be handy for some people, but it’s also a good example of some advanced RegExp. Note that you could also do this by parsing it to XML, and traversing with E4X.
First, let’s look at pulling out all of the tags:
var tags:Array = htmlText.match(/<[^<]+?>/g);
This code simply returns an array of substrings from the htmlText that match a simple regular expression. The regular expression matches any text that:
- < starts with <
- [^<]+? followed by one or more (+) characters that are not < ([^<]. This is a lazy, or non-greedy, repeat (?), which means it will find the minimum number of matching characters before matching the next element in the pattern.
- > ends with >.
Note that in order to match multiple substrings the pattern must have the global flag set (g).
Next, let’s pull out the individual text elements:
var text:Array = input.htmlText.match(/(?<=^|>)[^><]+?(?=<|$)/g);
This time the RegEx pattern is a bit more complex, incorporating positive forward and backward lookarounds. A lookaround allows you to search for something before or after your main pattern that you do not want included in the result.
- (?<=^|>) start with a positive lookbehind to match (but not return) either the beginning of the string or the end of tag (^|>).
- [^<]+? followed by a lazy search for one or more characters that are not <.
- (?=<|$) finish by using a lookahead to match (but not return) the beginning of the next tag, or the end of the string (<|$).
Here’s a simple demo of the code in action:
Note: The empty entry in the text list is a space that is between the tags after “HTML”, and the tags before “and”.
You can download the Flash CS3 FLA for the above example by clicking here.