Archive for the ‘cron’ Category

* One Problem Solved, New Problem Found

Posted on February 22nd, 2008 by Phil. Filed under Drupal, PHP, SQL, cron.


Not like I needed it, but the impending snow storm is again proving that winter stinks (to put it in family-friendly terms). I blame that over sized rat in Pennsylavania.

I’ve (temporarily) solved the timeout problem I’ve been having with importing TV schedule feeds using FeedAPI. Ideally, due to the time it takes to run them, I’d like to be able to schedule these feed imports separately from the rest of the Drupal cron jobs. The multi-threaded cron patch looked promising, but it was written for Drupal 5.1 and we’re currently building on 5.7. At this point I don’t want to get too deep into engineering some custom cron functionality. Instead, I’ll settle for a shortcut to allow my feeds to be imported nightly via the normal Drupal cron routine without timing out.

After digging a little deeper into the FeedAPI code I found that it’s implementation of hook_cron calls the function feedapi_cron_time before processing each row in a feed. That function basically checks that the total run time of the feed import hasn’t exceed the allotted percentage of total cron run time. For now, I’ve decided to simply have that function aways return true, so a timeout won’t happen.

Problem solved - for now. Hopefully I can revisit this later and find a better solution, like maybe multi-threaded cron will be more fully fleshed out and available.

Now that the entire feed import process is working I have a new problem.

I’d like to clean out the program and schedule data that I’ve accumulated in our development database and start importing data using the new code. This is actually quite a bit of data, as it includes data imported by an earlier version of the feed ingestor that we wrote last year. In fact, as of right now we have:

  • 539 programs
  • 4,926 episodes
  • 27,731 airings

This is far too much data to delete using the clunky delete content functionality (i.e. the node_delete function) in Drupal. That functionality is also pretty damned slow, as it has to churn through lots of PHP code and functions for each node to delete.

We’ve got a custom module for the TV schedules data that implements hook_nodeapi. If an episode gets deleted, this implementation makes sure all child airings are deleted. Likewise, if a program gets deleted, all child episodes (and hence all of their airings) get deleted. This ensures we should never have orphaned airings or episodes.

Last night I tried to delete 10 programs using the standard node_delete functionality and it took at least 30 minutes.

Obviously, deleting all program and schedule data isn’t something we’ll be doing often. But I do need to do it at least once and I may need to do it again at some point during the build and I don’t want to wait several days for it to get done. So I need a faster solution for deleting large amounts of data.

I did find this delete all module, which claimed to it allowed you delete content by content type and to choose either the standard (and safest) node_delete option or a faster (and riskier) bulk delete just using SQL statements. However, when I installed it I found all it offers is the option to delete all content and only using the usual PHP functions.

Soooo, I’m about to dig in and see if I can write my own SQL statements to delete this content cleanly and quickly.

In between round of shoveling my driveway, of course…

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Not So Fast, Mr. Smarty Pants

Posted on February 20th, 2008 by Phil. Filed under Drupal, PHP, Television, cron.


OK, that last post may have been premature.

I’m still having a timeout problem with the TV schedules import routine. Before I get into the latest issue, let’s take a step back and review what we’re doing here.

As I’ve said before, we’re importing TV program and schedule data from PBS via XML. These data are generated by TV Guide and provided to us as RSS feeds, one for each of our channels, with each daily file containing schedule data two weeks into the future.

In order to support the publication of a schedule grid, daily channel schedules and program/series/episode pages, we’re using CCK to create the following custom content types:

  • TV Channel
  • TV Program
  • TV Episode - Related to one program via node reference
  • TV Airing - Related to one episode and one channel via node references

We’re coding a routine to import the program and schedule XML data and create program/episode/airing nodes. The import is done using FeedAPI, the SimplePie parser and a custom feed processor of our own construction. We were originally thinking of using the FeedAPI Node module to process the data and create the nodes but didn’t because (1) we need to create or update up to three different nodes for each line of a feed (program/episode/airing nodes) and (2) the logic to match feed content to existing nodes is, due to the nature of the PBS data, quite messy.

We’ve got the import, parsing and processing code working. However, as I mentioned in my previous post, we’re running into issues with process timeouts during the feed refresh. There’s quite a bit of processing going on for each line in a schedule feed. Given that and the fact that each feed has 400+ items and that we’re working with six schedule feeds (one for each of our channels), the nightly feed refresh process is not quick. At this point, no real application or database tuning has occurred, which is part of the problem, but I think that even after everything is tuned to the hilt for performance we’ll still have this issue.

I first ran into this last week when I tried to refresh a feed using the Refresh link on the Feed Administration page. As I said earlier, I realized that this process was restricted but the maximum execution time PHP variable, which defaulted to 30 seconds. I was able to get around this by using the set_time_limit function to restart the timer as each row in the feed was processed. That worked great for refreshes initiated on the Feed Administration screen.

However, things broke down again when I tried to have the feeds refresh via cron. I found that drupal_cron_run, the function in includes/common.inc which executes cron jobs, sets the cron timeout to 240 seconds using set_time_limit. This should still be overridden by my code’s calls to set_time_limit.

After digging into the workings of FeedAPI I found that that module’s implementation of hook_cron checks after (or before, I forget which) each row of a feed is processed to make sure that the process is not using up more than the allotted percentage of the total cron run time as specified on the FeedAPI administration screen. I’ve got that percentage set to 75%, which works out to 180 seconds. Not enough.

The bottom line here is that we’ve got an issue with refreshing our feeds nightly via cron without being restricted by these timeout conditions. Ideally, I’d like to be able to schedule each feed refresh separately, outside of the global Drupal cron job. I’d also like them to run in a multi-threaded fashion.

It seems to me that the Drupal cron functionality needs a bit of work. It’d be nice, in general, to be able to schedule cron jobs for different modules at different times and at different intervals and even give them different timeout limits. The only thing I see out there which may help is this patch for multi-threaded cron jobs. I haven’t tried it yet, but will.

If you have a brilliant idea for how to crack this nut let me know by leaving a comment. Also, if you have any hot stock tips, leave those also.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



    www.flickr.com
    This is a Flickr badge showing public photos and videos from WGBH.org Development Blog. Make your own badge here.

Archives:

Categories:

  • Disclaimer

  • The opinions expressed in here are those of the writers/contributors and do not necessarily represent the views or opinions of the WGBH Educational Foundation.