Archive for February, 2008

* Views and Drupalcon Plans

Posted on February 29th, 2008 by Phil. Filed under Drupalcon, Views.


I’ve been up to my eyes, ears, nose and throat building out TV schedule related views. Things are moving along, I’ve got a nice working version of our Programs A-Z list and the daily schedules by channel. None of this involves any real theming work just yet; I’m focusing on functionality first. I’ll give more details on these pages next week.

Of more immediate concern, Drupalcon Boston is coming up next week and Pete and I will be attending, as will several others from WGBH. We’re looking forward to learning lots of Druaplish stuff, meeting folks and, of course, drinking free coffee. I mean, there is going to be free coffee, right?

Anyhow, there are lots of excellent sessions planned, across several different tracks. The full session list is here.

I’ll be spending most of time at the Site Building and Design & User Experience tracks. Here are some of the sessions that I’m planning to attend. If you see me there be sure to say hello.

I hope to see you there! I also hope there are free donuts.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Deleting is Fun!

Posted on February 24th, 2008 by Phil. Filed under Devel, MySQL, PHP, SQL, Tools.


I have come up with a solution for deleting large numbers of nodes that bypasses the normal node_delete PHP function, which can be quite slow. Below is a description of what I did which is not necessarily for the faint of heart. Use this method at your ow risk…

The Problem

As I described in our last post, I’ve now got a nice working version of our TV programs and schedules ingestor. In the course of development of this code we’ve previously ingested lots of TV program/episode/airing records creating thousands of nodes. Now that that code is in a somewhat complete and clean state, I wanted to delete all of the old content and start fresh, so we can proceed with the front end development.

The problem was I had well over 30,000 nodes, representing three different content types, that needed to be deleted. The usual method of deleting content in Drupal - the node_delete function - proved to be about as slow as my morning commute into Boston, and so was all but unusable in this situation.

What I needed was a way to delete all nodes of a given content type using SQL statements, rather than the standard Drupal/PHP functionality.

The Solution

Deleting records from the node table by content type is obviously pretty easy. The tricky part is that there can be all sorts of related database records in any number of tables generated by various modules that also need to be deleted. So, I needed an easy way to identify all database records for a node of a given content type.

The solution was to use the Devel module to identify everything that needs to be deleted. First, with this module installed and enabled I turned on the Collect query info and Display query log options. With these in place, each page called will display a list of the queries executed (along with number of executions and execution times) at the bottom.

Next, I created a page in Drupal that would delete a single record of the content type in question. The body of the page was a chunk of PHP code to use node_delete to delete a record, like so:

$res = db_query("SELECT n.nid FROM {node} n WHERE n.type = 'CONTENT_TYPE' LIMIT 1");
while ($n = db_fetch_object($res)) {
node_delete($n->nid);
}

In the above CONTENT_TYPE gets replaced the actual content type in question. You also need to set the input format to PHP code, so it’ll get executed, rather than just displayed.

Then, view this page in the browser. This will cause one node to be deleted and, thanks to the Devel module, all of the SQL statements executed during page generation to be listed at the bottom of the page. Among the list of SELECT statements executed by the page, you’ll also see a bunch of DELETE statements, like for example:

DELETE FROM node_revisions WHERE nid = 1234;

Once I gathered up all of these delete statements, I then rewrote each one to delete not just one record but all records related to a specific content type. The above DELETE statement got rewritten as:

DELETE t.* FROM node_revisions t JOIN node n on (t.nid = n.nid) WHERE n.type = 'CONTENT_TYPE';

I took all of these rewritten DELETE statements, put them into a file and executed the file using the MySQL command line tool, via the command:

mysql>source delete_all_airings.sql;

Voila! I repeated this for each content type that I wanted to clean out. All told, using this method I was able to delete all those nodes in under an hour.

Obviously, I don’t plan on doing this very often. But it sure was fun!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* One Problem Solved, New Problem Found

Posted on February 22nd, 2008 by Phil. Filed under Drupal, PHP, SQL, cron.


Not like I needed it, but the impending snow storm is again proving that winter stinks (to put it in family-friendly terms). I blame that over sized rat in Pennsylavania.

I’ve (temporarily) solved the timeout problem I’ve been having with importing TV schedule feeds using FeedAPI. Ideally, due to the time it takes to run them, I’d like to be able to schedule these feed imports separately from the rest of the Drupal cron jobs. The multi-threaded cron patch looked promising, but it was written for Drupal 5.1 and we’re currently building on 5.7. At this point I don’t want to get too deep into engineering some custom cron functionality. Instead, I’ll settle for a shortcut to allow my feeds to be imported nightly via the normal Drupal cron routine without timing out.

After digging a little deeper into the FeedAPI code I found that it’s implementation of hook_cron calls the function feedapi_cron_time before processing each row in a feed. That function basically checks that the total run time of the feed import hasn’t exceed the allotted percentage of total cron run time. For now, I’ve decided to simply have that function aways return true, so a timeout won’t happen.

Problem solved - for now. Hopefully I can revisit this later and find a better solution, like maybe multi-threaded cron will be more fully fleshed out and available.

Now that the entire feed import process is working I have a new problem.

I’d like to clean out the program and schedule data that I’ve accumulated in our development database and start importing data using the new code. This is actually quite a bit of data, as it includes data imported by an earlier version of the feed ingestor that we wrote last year. In fact, as of right now we have:

  • 539 programs
  • 4,926 episodes
  • 27,731 airings

This is far too much data to delete using the clunky delete content functionality (i.e. the node_delete function) in Drupal. That functionality is also pretty damned slow, as it has to churn through lots of PHP code and functions for each node to delete.

We’ve got a custom module for the TV schedules data that implements hook_nodeapi. If an episode gets deleted, this implementation makes sure all child airings are deleted. Likewise, if a program gets deleted, all child episodes (and hence all of their airings) get deleted. This ensures we should never have orphaned airings or episodes.

Last night I tried to delete 10 programs using the standard node_delete functionality and it took at least 30 minutes.

Obviously, deleting all program and schedule data isn’t something we’ll be doing often. But I do need to do it at least once and I may need to do it again at some point during the build and I don’t want to wait several days for it to get done. So I need a faster solution for deleting large amounts of data.

I did find this delete all module, which claimed to it allowed you delete content by content type and to choose either the standard (and safest) node_delete option or a faster (and riskier) bulk delete just using SQL statements. However, when I installed it I found all it offers is the option to delete all content and only using the usual PHP functions.

Soooo, I’m about to dig in and see if I can write my own SQL statements to delete this content cleanly and quickly.

In between round of shoveling my driveway, of course…

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Not So Fast, Mr. Smarty Pants

Posted on February 20th, 2008 by Phil. Filed under Drupal, PHP, Television, cron.


OK, that last post may have been premature.

I’m still having a timeout problem with the TV schedules import routine. Before I get into the latest issue, let’s take a step back and review what we’re doing here.

As I’ve said before, we’re importing TV program and schedule data from PBS via XML. These data are generated by TV Guide and provided to us as RSS feeds, one for each of our channels, with each daily file containing schedule data two weeks into the future.

In order to support the publication of a schedule grid, daily channel schedules and program/series/episode pages, we’re using CCK to create the following custom content types:

  • TV Channel
  • TV Program
  • TV Episode - Related to one program via node reference
  • TV Airing - Related to one episode and one channel via node references

We’re coding a routine to import the program and schedule XML data and create program/episode/airing nodes. The import is done using FeedAPI, the SimplePie parser and a custom feed processor of our own construction. We were originally thinking of using the FeedAPI Node module to process the data and create the nodes but didn’t because (1) we need to create or update up to three different nodes for each line of a feed (program/episode/airing nodes) and (2) the logic to match feed content to existing nodes is, due to the nature of the PBS data, quite messy.

We’ve got the import, parsing and processing code working. However, as I mentioned in my previous post, we’re running into issues with process timeouts during the feed refresh. There’s quite a bit of processing going on for each line in a schedule feed. Given that and the fact that each feed has 400+ items and that we’re working with six schedule feeds (one for each of our channels), the nightly feed refresh process is not quick. At this point, no real application or database tuning has occurred, which is part of the problem, but I think that even after everything is tuned to the hilt for performance we’ll still have this issue.

I first ran into this last week when I tried to refresh a feed using the Refresh link on the Feed Administration page. As I said earlier, I realized that this process was restricted but the maximum execution time PHP variable, which defaulted to 30 seconds. I was able to get around this by using the set_time_limit function to restart the timer as each row in the feed was processed. That worked great for refreshes initiated on the Feed Administration screen.

However, things broke down again when I tried to have the feeds refresh via cron. I found that drupal_cron_run, the function in includes/common.inc which executes cron jobs, sets the cron timeout to 240 seconds using set_time_limit. This should still be overridden by my code’s calls to set_time_limit.

After digging into the workings of FeedAPI I found that that module’s implementation of hook_cron checks after (or before, I forget which) each row of a feed is processed to make sure that the process is not using up more than the allotted percentage of the total cron run time as specified on the FeedAPI administration screen. I’ve got that percentage set to 75%, which works out to 180 seconds. Not enough.

The bottom line here is that we’ve got an issue with refreshing our feeds nightly via cron without being restricted by these timeout conditions. Ideally, I’d like to be able to schedule each feed refresh separately, outside of the global Drupal cron job. I’d also like them to run in a multi-threaded fashion.

It seems to me that the Drupal cron functionality needs a bit of work. It’d be nice, in general, to be able to schedule cron jobs for different modules at different times and at different intervals and even give them different timeout limits. The only thing I see out there which may help is this patch for multi-threaded cron jobs. I haven’t tried it yet, but will.

If you have a brilliant idea for how to crack this nut let me know by leaving a comment. Also, if you have any hot stock tips, leave those also.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* set_time_limit: Learn It, Know It, Live It

Posted on February 16th, 2008 by Phil. Filed under PHP, TV Guide, Television.


I’ve got the basic code for importing our TV programs and schedule data working. In a nutshell, PBS provides us an XML feed of schedule data provided by TV Guide for each of our channels. Using FeedAPI and a custom module we’re importing these data and creating various nodes. It basically works as intended! Gotta like that.

More later on the exact implementation details, but I was running into one problem: each time I tried to ingest a feed I was running into the following errors:

Fatal error: Maximum execution time of 30 seconds exceeded in… blah blah blah

Based on my years of application development experience and a very keen gut instinct I quickly surmised that a fatal error is bad. I then dug in to see what was doing here.

As the error message said, the code was timing out; it was taking longer than the maximum execution time as defined by the PHP setting variable max_execution_time. One possible solution here is to increase this value in the setting.php file. For example, we could double it to 60 seconds via:

ini_set('max_execution_time', '60');

This would increase the maximum execution time for all Drupal processes. Rather than do that, I chose option B, which involves the PHP function set_time_limit. You can call this function in a PHP script and it will restart the timeout counter, effectively increasing the maximum execution time on the fly.

So, I added the following call to a routine in the feed processing code, which gets called each time a record in the feed:

set_time_limit(30);

Voila! Problem solved. Time for a beer.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Drupal 5 or Drupal 6?

Posted on February 15th, 2008 by Phil. Filed under Drupal.


As we finally get down to coding the new WGBH.org we are all of a sudden faced with another big decision. Namely should we go with Drupal 5 or Drupal 6, which was officially released earlier this week.

At first this seemed like an easy choice. Drupal 5 is the reliable old standard now (relatively speaking), with lots and lots of nifty modules already in place and working. Plus, Drupal 5 is what we’ve been planning on and working with already. Why switch to a brand new release just as we’re finally getting the build off the ground? Doesn’t seem prudent

But Drupal 6 sure does look sexy.

I just checked out the D6 upgrade status of the modules we’re already planning to use for phase 1 of our rebuild and here’s what I found:

  • CCK - In Progress
  • Date - Unknown
  • FeedAPI - Unknown
  • Image - Done!
  • Image Assist - Unknown
  • Javascript Tools = Unknown
  • Pathauto - Done!
  • Update Status - Now in core!
  • Views - In Progress

If you have any additional or updated information on the status of these modules, let us know!

So, even assuming that Views and CCK are upgraded soon, there are some big holes here, most glaringly FeedAPI, which we’ll use to fetch and parse our TV program and schedule data from PBS. Plus, we will obviously have need for a variety of other modules along the way and who knows what those are and whether they’ll be upgraded any time soon. Clearly, we should just stick with Drupal 5, right?

But, dang, Drupal 6 sure does look sexy.

I mean, it’s got drag and drop administration, lots of theming improvements, more granular permissions and lots of performance and scalability improvements.

Hmmm…

Well, fortunately, I can table the decision for a bit. That’s because, while we’re already writing code, we have yet to build our official development environment. We’re getting a new server for development, on which we will have nice clean new installs of everything. For now, I am building on our old development server which has some issues. I figure once the new dev server is ready we can at that point review the module statuses and make the call then.

Hey, at least it’s nice to have choices!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Rebuild Phase 1: TV Schedules and Programs

Posted on February 12th, 2008 by Phil. Filed under PBS, Protrack, TV Guide, Television.


Early on in the plans for the rebuild of WGBH.org we made a strategic decision (or, as a certain well-known someone might say, a strategeric decision): to rebuild and relaunch the site in phases. The main reason was that we knew a complete overhaul of the site (i.e. new look and feel, new information architecture, new back end, etc.) was a meaty task that would take some time. However, we had one pressing problem that needed to be addressed more quickly. That problem was the publication of TV schedules and program information.

Ahh, TV schedules. Just the mere mention of them to those involved with getting them on the web site often evokes a quick intake of breath, a wince or an - in the more extreme cases - a curse word (or two). To put it mildly, our current process for publishing TV schedules and their related program information is painful. Not only painful, but very time consuming. Not only painful and time consuming but also annoying, frustrating, headache inducing, stomach churning, etc. etc.

I think you get my point.

What is that process, you ask? in a nutshell, here’s how it currently works:

(1) Through a series of complicated and mysterious processes television schedule information - as well as program and episode descriptions - for our various channels make it into a piece of software called Protrack, which is used by WGBH staff to actually manage what goes on the air. Protrackis designed for use by television programmers, engineers and technicians; the data in it is not meant for general public consumption.

(2) A home grown piece of Java code (which we call the ingestor) runs once an hour to export scheduling and program information from Protrack and import it into our current CMS. Now, bear in mind that the database schemas for Protrack and our CMS were developed independently and for different purposes. This Java code is trying to do the impossible: translate the data in Protrack into our CMS so it is ready for display to the public. There are two real problems here:

(a) The differences in the way the data are modeled in each system is very different. One program in Protrack can often have several different titles and versions (e.g. one version for regular airings, one for airings during pledge drives). For our purposes on the web, we only want one version of the program. The ingestor has to try and reconcile these differences in a programmatic, which is no easy task, due to the nature of the data. In my 5+ years here at WGBH this code has undergone at least two major revisions (i.e. reengineered from the ground up) and, due to the inconsistent and unrigorous nature of the upstream data, it still produces regular errors and needs constant babysitting. The end result is that human intervention is regularly required to clean up errors at ingest time.

(b) The bigger problem is that the data coming from Protrack is not meant for public consumption. Program titles and descriptions in that system can often contain information only meant for internal staff (e.g. “Great show for pledge!”). Or sometimes descriptions just aren’t there. So, all of the data coming in from Protrack needs to be copy edited - or just completely rewritten - by WGBH Online staff. This is very time consuming.

For these reasons we decided early on in the rebuild process that the old way of building schedule had to go ASAFP. After investigating our options (which included talking to other PBS stations and even hiring a consultant) we decided on a new method for publishing TV schedules and programs to WGBH.org. The first thing we needed was a new data source.

Luckily, PBS offers to member stations free XML feeds of TV Guide schedule data. This is a relatively new offering by PBS It’s one feed for each of our channels, providing airing and descriptive program/episode information two weeks into the future. The main advantage of these data is that the information, being curated by TV Guide, is ready for public consumption. In theory, each feed could be pulled, transformed via XSLT and displayed right on the site as is. The drawback is the data feed is updated once a day and won’t reflect last minute schedule changes (unlike the Protrack data).

We decided that the savings in editorial effort (not to mention the fact that the feeds are free to us) made this data source the one for us. However, due to the potential last minute schedule changes that wouldn’t be reflected in the data we also decided that some programming muscle would still be required to make these data work for us. So, that has led us to decide on the following new method for publishing TV schedules and program information to WGBH.org:

(1) Import the PBS XML schedules feeds into Drupal. During import create airing, episode and program nodes, from which we can produce a schedule grid, program A-Z list and program and episode description pages. Also, since we can design the database schema in Drupal around the structure of the PBS/TV Guide data, ingestor errors should be reduced.

(2) While the imported PBS data will be published by default, WGBH editorial staff will be able to create or modify schedule and program information as they see fit in our new CMS.

This new process will still involve some heavy lifting, development-wise, but should result in some significant time savings, particularly on the editorial side of things, allowing us to focus on other types of content curation and creation for WGBH.org.

This means that the rebuild of WGBH.org will take place in at least two distinct stages:

Phase 1: Replace the current engine behind TV schedules and program information with the new system. Keep the existing look and feel, site architecture and all other related content and systems.

Phase 2: Complete site redesign and rebuild, retaining the same process for TV programs and schedules. It’s likely that this phase will be divided into smaller phases itself, but that is TBD.

We are currently and actively involved in the development for Phase 1! Next time I will provide more details on the actual implementation of this phase.

Stay warm!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Getting Down To Business

Posted on February 11th, 2008 by Phil. Filed under Tools.


A lot of people come up to me and ask “What all is required to rebuild a site like WGBH.org using Drupal?” An excellent question! Rather than continuing to repeat my answer over and over, the time has come to write it down. Please take notes.

First, you need specification documents and a project plan. Preferably, both at least 75% complete.

Specification Documents

Next, you need this book: Pro Drupal Development.

Pro Drupal Development

Thirdly, you gotta have a comfy chair.

My office chair

Of course, you must have lots of coffee - black only. Anything less is wimpy.

Black coffee

WGBH is all about accessibility! See? Even our coffee is closed captioned.

Finally, last but most definitely not least in any way, you need … (pregnant pause) … crumb cake!

Crumb cake

Ok, technically, it doesn’t have to be crumb cake. Cookies, pudding and ice cream are also acceptable. Use you best judgment. We trust you.

So, you get the idea. Now let’s write some code!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* Drupalcon Boston

Posted on February 10th, 2008 by Phil. Filed under Drupal, Drupalcon.


Pete and I will be attending Drupalcon right here in Boston, from March 3-6. It’s sure to be chock full of Drupaly goodness!

If you’re planning on going, be sure to leave a comment and let us know so maybe we can meet up!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



* In With The New!

Posted on February 8th, 2008 by Phil. Filed under Architecture, CMS, Drupal.


As previously mentioned we’ve decided it’s time to overhaul WGBH.org. New look and feel, new information architecture, new CMS to manage all of the wonderful content we have, bigger offices with comfier chairs for the WGBH Online staff - the whole shebang! Ok - just kidding about the offices and chairs! Gotta make sure you’re awake.

The $64,000 question then is, what’s the plan for building the new system? What technologies are going to replace our tried and true circa 20001 tools? Like all good consumers we gave this a lot of thought, kicked lots of tires, flipped lots of coins and ultimately decided to hitch our wagon to the star that is (drum roll, please) Drupal!

What is Drupal, you ask? In short, it’s a leading open source content management system, with a neat logo to boot.

Drupal Logo
The The name Drupal comes from the Dutch word druppel which means drop, hence the logo. More info on the derivation of the name here.

We chose Drupal for a whole bunch of reasons, including (but not limited to):

  • Open Source - Our current CMS is an open source tool and we feel that the whole idea of open source dovetails nicely with the goals of public media, so we wanted to continue using (and supporting) open source.
  • Rich Functionality - Of course, open source tools aren’t worth it if they can’t (reliably) do what you need them to do. Drupal has been around since 2001 (version 6 is about to launch) and is a very mature product, with a rich offering of core functionality and add-on modules that provide much of what we need right out of the box, especially lots of Web 2.0 bells and whistles (e.g. blogs, comments, RSS feeds, etc.)
  • Flexibility - Being an open source tool if there is something which it doesn’t do but which we need we have the ability to modify it or (more importantly) add on to it using Drupal’s various APIs and hooks.
  • Technology - Drupal is written in PHP, a well known scripting language, and uses MySQL, a well known open source relational database. Again, we like the open source thing.
  • Performance and Scalability - Drupal is the engine behind many high profile and high volume web sites, such as The Onion, MTV UK, and Lifetime TV, so it has a proven track record of scalability and performance.
  • Support - Drupal has a strong and active community of developers and users, available to answer questions and help with solutions. There are a number of companies and individuals offering Drupal consulting, development and training services (for example, Lullabot) should the need arise.
  • Public Media Acceptance - A number of other public media companies are either already using Drupal or are in the midst of switching to it (e.g. WETA, NHPR, WXXI) and by other units within WGBH (e.g. The World, WGBH Lab). Being on a common platform with other public broadcasters and with other groups within our own foundation is very advantageous and should facilitate information sharing and application development and support.

Whew! See, once we considered all that we figured we couldn’t afford not to use Drupal!

In the end we settled on the following (warning: nerd terminology coming) technology stack for the new WGBH.org:

  • CMS - Drupal, which means a code base written in PHP!
  • Hardware & OS - Solaris. Drupal is often run on Linux, but we wanted to take advantage of existing hardware. So, no open source OS. You can’t have it all.
  • Database - MySQL
  • Web Server - Apache
  • Code Control - Subversion

So, it’s Drupal run on SAMP (Solaris, Apache, MySQL, PHP).

In the coming days I’ll start talking in more detail about the build plan and some actual code writin’!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Pownce
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis



    www.flickr.com
    This is a Flickr badge showing public photos and videos from WGBH.org Development Blog. Make your own badge here.

Archives:

Categories:

  • Disclaimer

  • The opinions expressed in here are those of the writers/contributors and do not necessarily represent the views or opinions of the WGBH Educational Foundation.