Drush Body Mangler

This past week I launched a new site, and of course, it's built on Drupal. Yay!

Coming out of that project, I wanted to publish one particular story about a problem I solved along the way, to offer an explanation (and perhaps an apology) for the new d.o git sandbox project I ended up creating, currently titled "Drush Body Mangler".

No, it won't mangle any physical bodies (I hope), but it can most certainly mangle some content node body fields! Indeed, if you find yourself needing to wrangle, mangle, tangle or untangle the simple title/body fields of a Drupal 6 "node" object, and you have a hankering to do so in a scripted way, and with the help of Drush, then my hope is that this post and the code will prove a useful example for some other developers. Beyond that (as I'll get to), I hope it can be refined and improved to be even more useful to the community :)

So: I'm in the final stages of putting the site together, and I realize that there is a small but significant flaw in a bunch of the content. Many of the link tags were lacking a "title" attribute which would appear as a rollover for most browsers, but more importantly would help folks using screen readers (and also the Googlebot, of course) identify where those links would lead them.

No problem, I thought- drush can surely be of use here. I should be able to find a script (or at least an example) of somebody using drush to search for a set of nodes, and then programmatically edit their contents, saving the results back to the site database. This is exactly the kind of thing drush is built for!

UGLY HACK ALERT!

Several search queries (both via google and directly on drupal.org) came up with nothing on this subject, and a thorough hint through the thousands of drupal projects was equally fruitless. I asked in a couple of the drupal-related IRC channels, with no response. Finally, I sent a plea to the #drupal #lazyweb on the twitters.

In very short order, I got a hopeful response, which in turn led to a short discussion with @skwashd, which confirmed my suspicion that I should be able to whip up something useful. Thanks for your help Dave!

Indeed, within a couple of hours, I'd hacked together a simple script which essentially calls node_load() to load a node I want to edit, parses the body of the node searching for <a> tags with no title attribute, and adding one in. Finally, it calls node_save() to save the node back to the DB, and my changes are instantly visible on reloading the page in question.

An hour later, I had some reasonably robust regex parsing that would mangle the particular nodes I was looking for in (almost) just the right way. It's not pretty, but it works. By way of apology, I also took the time to comment the code extensively, so as to illuminate the dark corners which loom between nearly every line. Code reviews (with git patches) are heartily welcome! :)

Towards making this generally useful (CCK and other complications).

Why so down on my own code? Well, because this was a total last-minute saving throw of a script, I have a responsibility to acknowledge the lazy/incomplete way in which I solved the problem. I have only done the bare minimum for what I needed, but it is nonetheless a starting point, and an instructive example in at least two ways I can think of. First, the code can be improved and modified in a variety of ways, and no doubt will need to be hacked on at least a little bit to be useful to anyone but myself at present. Beyond that, I wonder if my experience here might point to some potential improvements in Drupal core, which could have made this problem even easier to solve.

In our twitter exchange, Dave Hall rightly pointed out that in a general case, my approach to "editing" Drupal nodes programmatically (in particular calling node_save() to store the changed node) wouldn't likely work well, because of things like CCK fields and other modules affecting the rendering of the "body" on-the-fly. In my case I was happy to ignore this complication, since I knew the content I wanted to mangle existed only in the core "body" field so that a simple node_load/node_save would suffice.

The safer method Dave suggested was to use drupal_execute() to essentially submit the node/x/edit form through the script, rather than make a direct node_save() call. This should allow CCK and other modules to do their stuff exactly as if the form had been manually submitted. I say "should" only because I've previously had some difficulty with getting drupal_execute to work as advertised. Perhaps I've just missed some trick that makes it easy?

In addition to better "CCK-aware" node field editing support, several obvious improvements come to mind:

  • Configurable search/replace strings
  • Support common scenarios like missing attributes, unclosed tags, etc.
  • The ability to run in batches via cron to regularly scan/clean a site's content
  • Make it "pluggable" so the core load/save functions were separate from "editing" code

Anyone interested in building some of these pieces?

Shouldn't this have been (even) easier in the first place?

The second point that strikes me as instructive is that I can't help but feel like there should have already been a way, or at least an API (we like those in Drupal now, right? ;) that would make this exercise trivial.

Why do I apparently have to choose between low-level functions that don't (apparently) "get everything" when I have a site built with the more complex core node mechanisms and CCK fields, and a robot-form-submitter that only kinda works some of the time because FormAPI is notoriously finicky?

I'm really not complaining here, but honestly I was surprised. What I do think should come out of this experience is either a) someone showing me the way that Drupal does (or will) provide a good, clear interface for this, or b) we have some good conversations toward that goal. Personally, I'm in love with Drush and would thoroughly enjoy being able to readily edit the content of my Drupal site in vim as easily as I currently edit module/theme files for the same site ;)

Does D7 make this easier? If not, I think D8 ought to.

Unfortunately, I haven't really been able to digest all the new shiny-ness of D7 yet, so I know that I really don't know jack about how this may or may not work better now. Here again: feel free to school me in the comments! If there is a clean and clear way to do this in D7 I'd love to know about it, and this will of course fuel my motivation to get up to speed with it sooner than later.

If not, let's make it so. In fact, I'm inclined to say: let's bring all of Drush into core! But then I'm probably getting *really* carried away, right?