Last week I came across an epic rant within a forum thread1 about why using regular expressions for parsing XML is a bad idea.
The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty.
At first, I was a little surprised. I love using regular expressions to make bulk changes throughout an XHTML document or even across a project consisting of hundreds of files. But, after reading through the post several times and thinkng about what I’ve been able to accomplish with some (relatively) simple XSLT files and a XML parser, it occurred to me that it is absolutely correct.
You, see as great as regular expressions are, they are not aware of the context. They have no idea if your matching a pattern within a C++ routine or an XHTML file. They can only parse characters and short strings as they are, with no understanding of their meaning.
EXstensible Stylesheet Language Transforms, on the other hand, are solely for the purpose of manipulating XML content. By definition, they are aware of XML elements and their attributes. The entire purpose of them is high-level modifications. In fact, after having used them now to successfully convert some XHTML to DITA XML, I have to say the powers feel almost god-like.
RegEx still have their use with XML—particularly with badly formed SGML/HTML one might have had dumped in their lap. But if the need is actually manipulating XML elements or attributes within a file (or even across files), then it’s really foolish to try to accomplish something with multiple regular expressions when a single XSL template will do (and often without the unintended consequences of a greedy RegEx).
- And when I say epic, I mean it goes from making a case as to why RegEx is simply insufficiently high-level enough to deal with HTML parsing to opening the gates of the abyss and letting the deep ones in to your mind. [↩]