Make a Folder Manifest for XML Files

One task that has come up quite a lot as I’m working with a lot of XML files (mostly DITA content) is I need a way to create a list of all the XML files within a folder. More than not, I want this list to be an XML file, too. There’s really no folder- (or even file-) level operations in XSLT to do this. It’s simply not what that language is used for. To do this, I had to create a simple script. Using scrips like this is very easy to integrate into the DITA-OT (though not where I use this particular script).

If you’re a web developer, there’s probably many better ways to go about doing this than using a Windows Batch file. You probably already know many of them. This isn’t intended to be used in a web data scenario, but more for local XML data management tasks.

The Windows Batch File

I personally really like the Windows batch file command language. It’s pretty simple, even though it does lack a lot of nice features1. When you want to do folder or file operations in Windows, I think it’s the easiest thing to use even when you’re a really poor programming like I am.

This batch file writes three pieces of information to an external XML file:

  1. It writes a root node to an XML file. It also adds the folder path into an attribute of the root node, which can be useful for post-processing.
  2. For every XML file in the folder, it adds a child node after the root node’s open tag. These child nodes will contain a link to these XML files in the folder.
  3. It writes a close tag for the root note.

I refer to this new XML file as a manifest, as it lists all fo the contents (well, XML files in this case, anyway) in the folder. Once an XML file is created with this information, XSLT can then be used to use or change the information in those files by running against this manifest file.

So, MakeManifest.bat looks like this:

SET output=manifest.xml
ECHO ^<manifest sourcepath="%~dp0"^> > %output%
FOR %%f in ("*.xml") DO (
    ECHO      ^<file href="%%~nf.xml"/^> >> %output%
)
ECHO ^</manifest^> >> %output%

Copy those lines into a plain text editor and save it with the file extension .bat and give it a try!. That’s all there is to it. If none of that makes any sense to you, I’ll refer you to SS64’s CMD reference page.

It is worth noting that (and the sharp reader might have figured this out already) this list will include a refernce to itself, itself being another XML file in the folder. You could simply rename the output file extension to something else (.txt, .manifest, etc.), which is a good reason I put in a variable to make that easy to do. It doesn’t affect what’s in the file.

Post-Processing the Manifest File

In my case, these XML files tend to be DITA topics. What I’m really after here is to create a DITA map. With a little XSLT file to process this manifest —which can be run from the same Windows batch file— it’s easy to create a DTIA map for all of the DITA topics the script finds in the folder.

Now, to do this, I use Saxon9HE, which is the opens source version of Saxonica’s (Michael Kay’s) XSLT processor. It’s easy to use, very fast, supports the latest versions of everything, and free.

I’ll follow up this post with another soon about how to do just that. I wanted to post this step first so as to not overwhelm someone who is learning (nor give me an excuse to put off posting anything).

  1. Most notably, to me, is regular expressions. However, the RxFind utility is a great way to add regular expression search and replace functionality to your Windows batch files and I use it a lot. []

Regular Expressions versus XSLT

Last week I came across an epic rant within a forum thread1 about why using regular expressions for parsing XML is a bad idea.

The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty.

At first, I was a little surprised. I love using regular expressions to make bulk changes throughout an XHTML document or even across a project consisting of hundreds of files. But, after reading through the post several times and thinkng about what I’ve been able to accomplish with some (relatively) simple XSLT files and a XML parser, it occurred to me that it is absolutely correct.

You, see as great as regular expressions are, they are not aware of the context. They have no idea if your matching a pattern within a C++ routine or an XHTML file. They can only parse characters and short strings as they are, with no understanding of their meaning.

EXstensible Stylesheet Language Transforms, on the other hand, are solely for the purpose of manipulating XML content. By definition, they are aware of XML elements and their attributes. The entire purpose of them is high-level modifications. In fact, after having used them now to successfully convert some XHTML to DITA XML, I have to say the powers feel almost god-like.

RegEx still have their use with XML—particularly with badly formed SGML/HTML one might have had dumped in their lap. But if the need is actually manipulating XML elements or attributes within a file (or even across files), then it’s really foolish to try to accomplish something with multiple regular expressions when a single XSL template will do (and often without the unintended consequences of a greedy RegEx).

  1. And when I say epic, I mean it goes from making a case as to why RegEx is simply insufficiently high-level enough to deal with HTML parsing to opening the gates of the abyss and letting the deep ones in to your mind. []