Parsing Tags and Text From HTML

Here’s some quick code to parse individual tags and text elements out of an html string. It might be handy for some people, but it’s also a good example of some advanced RegExp. Note that you could also do this by parsing it to XML, and traversing with E4X.

First, let’s look at pulling out all of the tags:

var tags:Array = htmlText.match(/<[^<]+?>/g);

This code simply returns an array of substrings from the htmlText that match a simple regular expression. The regular expression matches any text that:

< starts with <
[^<]+? followed by one or more (+) characters that are not < ([^<]. This is a lazy, or non-greedy, repeat (?), which means it will find the minimum number of matching characters before matching the next element in the pattern.
> ends with >.

Note that in order to match multiple substrings the pattern must have the global flag set (g).

Next, let’s pull out the individual text elements:

var text:Array = input.htmlText.match(/(?<=^|>)[^><]+?(?=<|$)/g);

This time the RegEx pattern is a bit more complex, incorporating positive forward and backward lookarounds. A lookaround allows you to search for something before or after your main pattern that you do not want included in the result.

(?<=^|>) start with a positive lookbehind to match (but not return) either the beginning of the string or the end of tag (^|>).
[^<]+? followed by a lazy search for one or more characters that are not <.
(?=<|$) finish by using a lookahead to match (but not return) the beginning of the next tag, or the end of the string (<|$).

Here’s a simple demo of the code in action:

Note: The empty entry in the text list is a space that is between the tags after “HTML”, and the tags before “and”.

You can download the Flash CS3 FLA for the above example by clicking here.

5 Comments

Todd Perkins March 13, 2008 at 2:31pm

Thanks! I’ve been looking for a quick way to do this. Nice job!
dogeroski March 14, 2008 at 4:55am

Great job! I was always doing this using while(htmlText.indexOf())..that was a real mess 🙂

Best regards.
Cedric M. (aka maddec) March 18, 2008 at 12:11pm

Thank you Grant! I was looking exactly for that few times ago, but hadn’t time to go further into Regex! Very useful, for example to get raw text from the Rich Text Editor of Flex…

I found this tool to simplify edition of regex codes, maybe do you already know about it:

Expresso by Ultrapico.

Best regards.
Daniel B. March 19, 2008 at 2:34am

Hey Grant! Thanks for the Code, helps me a lot to understand RegExp better. Greets, Daniel
Eric March 25, 2008 at 11:56am

Thanks.

I need a regex refresher. When I start to think of all the expressions my head hurts.

There is a tool called regexdesigner written in vb.net that is useful for building regex strings to parse or search for text that may be useful for some people.