Make a Folder Manifest for XML Files

One task that has come up quite a lot as I’m work­ing with a lot of XML files (most­ly DITA con­tent) is I need a way to cre­ate a list of all the XML files with­in a fold­er. More than not, I want this list to be an XML file, too. There’s real­ly no fold­er- (or even file-) lev­el oper­a­tions in XSLT to do this. It’s sim­ply not what that lan­guage is used for. To do this, I had to cre­ate a sim­ple script. Using scrips like this is very easy to inte­grate into the DITA-OT (though not where I use this par­tic­u­lar script).

If you’re a web devel­op­er, there’s prob­a­bly many bet­ter ways to go about doing this than using a Win­dows Batch file. You prob­a­bly already know many of them. This isn’t intend­ed to be used in a web data sce­nario, but more for local XML data man­age­ment tasks.

The Windows Batch File

I per­son­al­ly real­ly like the Win­dows batch file com­mand lan­guage. It’s pret­ty sim­ple, even though it does lack a lot of nice fea­tures1. When you want to do fold­er or file oper­a­tions in Win­dows, I think it’s the eas­i­est thing to use even when you’re a real­ly poor pro­gram­ming like I am.

This batch file writes three pieces of infor­ma­tion to an exter­nal XML file:

  1. It writes a root node to an XML file. It also adds the fold­er path into an attribute of the root node, which can be use­ful for post-pro­cess­ing.
  2. For every XML file in the fold­er, it adds a child node after the root node’s open tag. These child nodes will con­tain a link to these XML files in the fold­er.
  3. It writes a close tag for the root note.

I refer to this new XML file as a man­i­fest, as it lists all fo the con­tents (well, XML files in this case, any­way) in the fold­er. Once an XML file is cre­at­ed with this infor­ma­tion, XSLT can then be used to use or change the infor­ma­tion in those files by run­ning against this man­i­fest file.

So, MakeManifest.bat looks like this:

SET output=manifest.xml
ECHO ^<manifest sourcepath="%~dp0"^> > %output%
FOR %%f in ("*.xml") DO (
    ECHO      ^<file href="%%~nf.xml"/^> >> %output%
ECHO ^</manifest^> >> %output%

Copy those lines into a plain text edi­tor and save it with the file exten­sion .bat and give it a try!. That’s all there is to it. If none of that makes any sense to you, I’ll refer you to SS64’s CMD ref­er­ence page.

It is worth not­ing that (and the sharp read­er might have fig­ured this out already) this list will include a refer­nce to itself, itself being anoth­er XML file in the fold­er. You could sim­ply rename the out­put file exten­sion to some­thing else (.txt, .man­i­fest, etc.), which is a good rea­son I put in a vari­able to make that easy to do. It does­n’t affect what’s in the file.

Post-Processing the Manifest File

In my case, these XML files tend to be DITA top­ics. What I’m real­ly after here is to cre­ate a DITA map. With a lit­tle XSLT file to process this man­i­fest —which can be run from the same Win­dows batch file— it’s easy to cre­ate a DTIA map for all of the DITA top­ics the script finds in the fold­er.

Now, to do this, I use Saxon9HE, which is the opens source ver­sion of Sax­on­i­ca’s (Michael Kay’s) XSLT proces­sor. It’s easy to use, very fast, sup­ports the lat­est ver­sions of every­thing, and free.

I’ll fol­low up this post with anoth­er soon about how to do just that. I want­ed to post this step first so as to not over­whelm some­one who is learn­ing (nor give me an excuse to put off post­ing any­thing).

  1. Most notably, to me, is reg­u­lar expres­sions. How­ev­er, the RxFind util­i­ty is a great way to add reg­u­lar expres­sion search and replace func­tion­al­i­ty to your Win­dows batch files and I use it a lot. []

Regular Expressions versus XSLT

Last week I came across an epic rant with­in a forum thread1 about why using reg­u­lar expres­sions for pars­ing XML is a bad idea.

The <cen­ter> can­not hold it is too late. The force of regex and HTML togeth­er in the same con­cep­tu­al space will destroy your mind like so much watery put­ty.

At first, I was a lit­tle sur­prised. I love using reg­u­lar expres­sions to make bulk changes through­out an XHTML doc­u­ment or even across a project con­sist­ing of hun­dreds of files. But, after read­ing through the post sev­er­al times and thinkng about what I’ve been able to accom­plish with some (rel­a­tive­ly) sim­ple XSLT files and a XML pars­er, it occurred to me that it is absolute­ly cor­rect.

You, see as great as reg­u­lar expres­sions are, they are not aware of the con­text. They have no idea if your match­ing a pat­tern with­in a C++ rou­tine or an XHTML file. They can only parse char­ac­ters and short strings as they are, with no under­stand­ing of their mean­ing.

EXsten­si­ble Stylesheet Lan­guage Trans­forms, on the oth­er hand, are sole­ly for the pur­pose of manip­u­lat­ing XML con­tent. By def­i­n­i­tion, they are aware of XML ele­ments and their attrib­ut­es. The entire pur­pose of them is high-lev­el mod­i­fi­ca­tions. In fact, after hav­ing used them now to suc­cess­ful­ly con­vert some XHTML to DITA XML, I have to say the pow­ers feel almost god-like.

RegEx still have their use with XML—particularly with bad­ly formed SGML/HTML one might have had dumped in their lap. But if the need is actu­al­ly manip­u­lat­ing XML ele­ments or attrib­ut­es with­in a file (or even across files), then it’s real­ly fool­ish to try to accom­plish some­thing with mul­ti­ple reg­u­lar expres­sions when a sin­gle XSL tem­plate will do (and often with­out the unin­tend­ed con­se­quences of a greedy RegEx).

  1. And when I say epic, I mean it goes from mak­ing a case as to why RegEx is sim­ply insuf­fi­cient­ly high-lev­el enough to deal with HTML pars­ing to open­ing the gates of the abyss and let­ting the deep ones in to your mind. []