One Problem Solved, New Problem Found

Not like I needed it, but the impending snow storm is again proving that winter stinks (to put it in family-friendly terms). I blame that over sized rat in Pennsylavania.

I’ve (temporarily) solved the timeout problem I’ve been having with importing TV schedule feeds using FeedAPI. Ideally, due to the time it takes to run them, I’d like to be able to schedule these feed imports separately from the rest of the Drupal cron jobs. The multi-threaded cron patch looked promising, but it was written for Drupal 5.1 and we’re currently building on 5.7. At this point I don’t want to get too deep into engineering some custom cron functionality. Instead, I’ll settle for a shortcut to allow my feeds to be imported nightly via the normal Drupal cron routine without timing out.

After digging a little deeper into the FeedAPI code I found that it’s implementation of hook_cron calls the function feedapi_cron_time before processing each row in a feed. That function basically checks that the total run time of the feed import hasn’t exceed the allotted percentage of total cron run time. For now, I’ve decided to simply have that function aways return true, so a timeout won’t happen.

Problem solved - for now. Hopefully I can revisit this later and find a better solution, like maybe multi-threaded cron will be more fully fleshed out and available.

Now that the entire feed import process is working I have a new problem.

I’d like to clean out the program and schedule data that I’ve accumulated in our development database and start importing data using the new code. This is actually quite a bit of data, as it includes data imported by an earlier version of the feed ingestor that we wrote last year. In fact, as of right now we have:

  • 539 programs
  • 4,926 episodes
  • 27,731 airings

This is far too much data to delete using the clunky delete content functionality (i.e. the node_delete function) in Drupal. That functionality is also pretty damned slow, as it has to churn through lots of PHP code and functions for each node to delete.

We’ve got a custom module for the TV schedules data that implements hook_nodeapi. If an episode gets deleted, this implementation makes sure all child airings are deleted. Likewise, if a program gets deleted, all child episodes (and hence all of their airings) get deleted. This ensures we should never have orphaned airings or episodes.

Last night I tried to delete 10 programs using the standard node_delete functionality and it took at least 30 minutes.

Obviously, deleting all program and schedule data isn’t something we’ll be doing often. But I do need to do it at least once and I may need to do it again at some point during the build and I don’t want to wait several days for it to get done. So I need a faster solution for deleting large amounts of data.

I did find this delete all module, which claimed to it allowed you delete content by content type and to choose either the standard (and safest) node_delete option or a faster (and riskier) bulk delete just using SQL statements. However, when I installed it I found all it offers is the option to delete all content and only using the usual PHP functions.

Soooo, I’m about to dig in and see if I can write my own SQL statements to delete this content cleanly and quickly.

In between round of shoveling my driveway, of course…

Leave a comment