Regular Expressions versus XSLT

Last week I came across an epic rant with­in a forum thread1 about why using reg­u­lar expres­sions for pars­ing XML is a bad idea.

The <cen­ter> can­not hold it is too late. The force of regex and HTML togeth­er in the same con­cep­tu­al space will destroy your mind like so much watery putty.

At first, I was a lit­tle sur­prised. I love using reg­u­lar expres­sions to make bulk changes through­out an XHTML doc­u­ment or even across a project con­sist­ing of hun­dreds of files. But, after read­ing through the post sev­er­al times and thinkng about what I’ve been able to accom­plish with some (rel­a­tive­ly) sim­ple XSLT files and a XML pars­er, it occurred to me that it is absolute­ly correct.

You, see as great as reg­u­lar expres­sions are, they are not aware of the con­text. They have no idea if your match­ing a pat­tern with­in a C++ rou­tine or an XHTML file. They can only parse char­ac­ters and short strings as they are, with no under­stand­ing of their meaning.

EXsten­si­ble Stylesheet Lan­guage Trans­forms, on the oth­er hand, are sole­ly for the pur­pose of manip­u­lat­ing XML con­tent. By def­i­n­i­tion, they are aware of XML ele­ments and their attrib­ut­es. The entire pur­pose of them is high-lev­el mod­i­fi­ca­tions. In fact, after hav­ing used them now to suc­cess­ful­ly con­vert some XHTML to DITA XML, I have to say the pow­ers feel almost god-like.

RegEx still have their use with XML—particularly with bad­ly formed SGML/HTML one might have had dumped in their lap. But if the need is actu­al­ly manip­u­lat­ing XML ele­ments or attrib­ut­es with­in a file (or even across files), then it’s real­ly fool­ish to try to accom­plish some­thing with mul­ti­ple reg­u­lar expres­sions when a sin­gle XSL tem­plate will do (and often with­out the unin­tend­ed con­se­quences of a greedy RegEx).

  1. And when I say epic, I mean it goes from mak­ing a case as to why RegEx is sim­ply insuf­fi­cient­ly high-lev­el enough to deal with HTML pars­ing to open­ing the gates of the abyss and let­ting the deep ones in to your mind. []

By Jason Coleman

Structural engineer and technical content manager Bentley Systems by day. Geeky father and husband all the rest of time.

Leave a comment

Your email address will not be published. Required fields are marked *