Archive for the ‘PHP’ Category
* Pass the Aspirin
Posted on October 24th, 2008 by Phil. Filed under Drupal, PHP, Television, Views, tags.
For those of us in the northern hemisphere, fall has arrived! In between raking up and burning piles of leaves (and useless 401K statements), we here at WGBH Online have continued to fine tune our new(ish) TV Programs and Schedules pages.
As you may recall, not long after launch in August, we began to revisit the whole notion of how we’re tagging our TV programs and episodes. The main reason was to improve the way we generate lists of related programs, so as to suggest to visitors other shows they might like. Our initial approach was simple: just tag the programs (not individual episodes) and use a Drupal view to generate a list of up to three related programs.
But this soon proved restrictive. Sure, Frontline is a News and Public Affairs program, but individual shows in the series can be about different things (technology, politics, science). So, we wanted to be able to capture this more detailed level of information and use it to generate more useful lists of related programs for our visitors.
After much thought and discussion (not to mention headaches), we came up with an expanded tagging scheme and more sophisticated program matching logic, which has now been implemented on the site. Here’s what we did:
We renamed our existing TV Program Genre vocabulary to TV Program Primary Genres.
The terms remained the same (a small set of high level classifications) and these are still only applied at the program level.
We then added a new vocabulary that can be applied to both TV programs and episodes: TV Program/Episode Secondary Genres.
This secondary list has many more terms that now allow for a more sophisticated level of classification. tags applied at the program level apply to all episodes in a series. Tags applied at the episode level are only applicable to that particular episode.
Once we had that in place we then had to think about how, using these tags and given a single program episode, we would define rules for identifying “related” programs and episodes.
This is where the aforementioned headaches started to kick in.
Once you started to think about it, all sorts of questions cropped up, like, which carries more weight, matching primary genre tags or secondary genre tags (or should they count equally)? Or, assuming two related programs have the same tags as the target episode, how to break the tie? Or, do we match an episode within one series to other episodes in that series or restrict it to episodes of other series?
Pass the aspirin, because I’m getting a headache just thinking about it again.
Luckily, we have some fine folks working here who sat down and really noodled through this to come up with some matching logic. When written out, the matching rules looked something like this:
1. Match at the episode level
2. Cull only from upcoming or recently-aired episodes
3. Look for most tag matches, with all tags equally weighted
4. Only allow one episode per program/series to appear in “You Might Also Like” box
5. In a tie, give priority to episodes with same “Program Primary” tag
6. If still a tie, give priority to episodes with exact same tag makeup (i.e. both have only one Primary tag)
7. If still a tie, give priority to the episode with soonest upcoming airing.
The idea was then to use the tags and these rules to generate up to three matches for each episode to display in the “You might also like” block in the right hand rail.
Well, up to three matches, unless there were more than three episodes with the exact same tag structure as the target episode. In that case, we will display up to five such matches.
No sweat!
In order to actually implement this, we could no longer just spit out the results from a view. Nope. Instead, we had to jump through a whole bunch of hoops. Here’s the thumbnail sketch of the implementation:
1. Given the tags for a target episode, query a view of TV programs, fetching all programs that match at least one Program Primary or Secondary tag.
2. Filter this list of programs, including only programs with an airing in our schedule data window (one week back, two weeks ahead).
3. Then count the exact number of tag matches and calculate a matching score for each program, based on the above rules. Then store the program in an array.
NOTE: I won’t go into the exact matching score formula here. Suffice it to say we came up with a formula that encapsulates the above matching and ordering rules. Please pass the aspirin again…
4. Next query a view of TV episodes, fetching all episodes that match at least one Episode Secondary genre of the episode in question.
5. Filter this list of episodes, including only those with an airing in our schedule data window. For each one count the exact number of matching genre tags for the episode and calculate the matching score. See if the episode’s parent program is already in the array of matching programs. If so, replace it with this episode if the matching score is higher.
6. Given the final array of matching episodes, reorder the array by the matching scores and display the top three (or five) entries!
The resulting PHP code to implement all of this ran to about 240 lines and looked a little something like this:
All that just to generate this on the front end:
Anybody know the limit on the number of aspirin you can take in one day?
* How Do You Spell Relief? M-O-S-H-E!
Posted on May 21st, 2008 by Phil. Filed under Devel, Drupal, FeedAPI, PHP, Pathauto, Television, Token.
As you regular readers know, we’ve been struggling with a little memory problem here as of late. Basically, under several different circumstances, PHP would quit and tell us it had used up all of its allocated memory, which we had jacked all the way up to 512MB.
At first, I suspected it was due to some server configuration issues, since we had just built an entirely new environment from the ground up. After a while we figured out it was definitely a code issue, and possibly more than one. After examining our own code, poking around contributed module code and some trial and error, I hadn’t been able to pinpoint the problem. Things were starting to look … worrisome.
But, as they say, it’s always darkest before the dawn for the other day what to my wondering eyes should appear but a comment on this blog from Moshe Weitzman. Moshe is one of the original Drupal developers and remains one the key core and contributed module maintainers to this day. Apparently, he is also a local resident and WGBH fan.
He’s also a very nice guy.
Moshe had discovered this blog and offered up his considerable help. Right away he fixed a memory leak in the Devel module and released a new version, which solved our problem of sometimes running out of memory when invoking the theme editor.
But he wasn’t done helping us with just that tidbit. Oh me oh my no.
I gave him a rundown on our problem of running out of memory when doing our nightly TV schedule data import and he was able top quickly suggest some possible culprits. Sure enough, after some tinkering around, I found that the problem went away when I disabled the Pathauto and Token modules. I tinkered with the FeedAPI cron function to disable these modules at the start and reenable them at the end of the process. As a result, memory usage during the import of our schedule data dropped by ten-fold.
Whew!
This fix, however, did introduce one new wrinkle: we used the Pathauto module to set URL aliases for new TV program and episode nodes that are created during the nightly import. By disabling Pathauto, I then had to write my own bit of code to set these aliases during import. Not a huge deal and, really, quite a small price to pay.
The bottom line of all this is that I am now sleeping just a little bit better each night and we’ve been able to ratchet down the maximum amount of memory assigned to PHP from 512MB to 128MB.
Thanks again, Moshe!
* It’s Alive!
Posted on May 16th, 2008 by Phil. Filed under Drupal, PHP, Television, Testing.
This week brought good news and bad news. Actually, more like very good news and pretty annoying news. First, the very good news:
We have officially posted some pages for internal testing! Behold…
TV Programs A-Z (click to enlarge)
Nothing too fancy here. On our current site the A-Z list is a popup. But now it’s a full grown page! One note: the Search TV Programs form does not yet work, on this page or any other.
TV Schedule Grid (click to enlarge)
Now we’re talking! The grid is the big magilla of this project. It finally allows us to display schedule information for all of our channels at once. Basically, we’re finally catching up to the rest of the world.
As you can see, the grid displays schedule information in three hour blocks. The user can navigate forward or backward or jump to a specific block of time using the Pick a Time form at the top. There is also the calendar selector, which lets users view the schedules for a given day. Note how the calendar highlights the current day (or the day of the schedule that you’re looking at), as well as the schedule data window, the period of time for which we display schedule data which, as of now, is one week week back, two weeks forward (almost).
We still need to play around with limiting the number of characters in the program or episode title that we display on the grid. There’s always something…
Full Day TV Schedules by Channel (click to enlarge)
As you can see, the full day schedule shares the calendar selector with the grid, and replaces the Pick a Time selector with a Pick a Channel form. Nice!
Now we can proceed with some preliminary testing, while Pete and I get to work on the program/series, episode, search and other pages.
Ok, on to the pretty annoying news. This… (click to enlarge)…
…is still happening.
Under a couple of different scenarios, the underlying PHP process uses up its allotted memory and then - like one of my kids - holds its breath and refuses to continue until is gets what it wants (more memory!). The above error was generated simply by trying to enable the theme developer. It can also happen during our nightly schedule data ingest, though our current allocation of 512MB is enough to prevent this, thank goodness.
So far all I’ve been able to confirm is that it’s not a server configuration issue. It seems to be a code leak and there may be more than one culprit out there. It’s starting to give me real headaches and needs to be resolved in the not too distant future.
However, for now, I refuse to let it ruin my weekend!
* Deleting is Fun!
Posted on February 24th, 2008 by Phil. Filed under Devel, MySQL, PHP, SQL, Tools.
I have come up with a solution for deleting large numbers of nodes that bypasses the normal node_delete PHP function, which can be quite slow. Below is a description of what I did which is not necessarily for the faint of heart. Use this method at your ow risk…
The Problem
As I described in our last post, I’ve now got a nice working version of our TV programs and schedules ingestor. In the course of development of this code we’ve previously ingested lots of TV program/episode/airing records creating thousands of nodes. Now that that code is in a somewhat complete and clean state, I wanted to delete all of the old content and start fresh, so we can proceed with the front end development.
The problem was I had well over 30,000 nodes, representing three different content types, that needed to be deleted. The usual method of deleting content in Drupal - the node_delete function - proved to be about as slow as my morning commute into Boston, and so was all but unusable in this situation.
What I needed was a way to delete all nodes of a given content type using SQL statements, rather than the standard Drupal/PHP functionality.
The Solution
Deleting records from the node table by content type is obviously pretty easy. The tricky part is that there can be all sorts of related database records in any number of tables generated by various modules that also need to be deleted. So, I needed an easy way to identify all database records for a node of a given content type.
The solution was to use the Devel module to identify everything that needs to be deleted. First, with this module installed and enabled I turned on the Collect query info and Display query log options. With these in place, each page called will display a list of the queries executed (along with number of executions and execution times) at the bottom.
Next, I created a page in Drupal that would delete a single record of the content type in question. The body of the page was a chunk of PHP code to use node_delete to delete a record, like so:
$res = db_query("SELECT n.nid FROM {node} n WHERE n.type = 'CONTENT_TYPE' LIMIT 1");
while ($n = db_fetch_object($res)) {
node_delete($n->nid);
}
In the above CONTENT_TYPE gets replaced the actual content type in question. You also need to set the input format to PHP code, so it’ll get executed, rather than just displayed.
Then, view this page in the browser. This will cause one node to be deleted and, thanks to the Devel module, all of the SQL statements executed during page generation to be listed at the bottom of the page. Among the list of SELECT statements executed by the page, you’ll also see a bunch of DELETE statements, like for example:
DELETE FROM node_revisions WHERE nid = 1234;
Once I gathered up all of these delete statements, I then rewrote each one to delete not just one record but all records related to a specific content type. The above DELETE statement got rewritten as:
DELETE t.* FROM node_revisions t JOIN node n on (t.nid = n.nid) WHERE n.type = 'CONTENT_TYPE';
I took all of these rewritten DELETE statements, put them into a file and executed the file using the MySQL command line tool, via the command:
mysql>source delete_all_airings.sql;
Voila! I repeated this for each content type that I wanted to clean out. All told, using this method I was able to delete all those nodes in under an hour.
Obviously, I don’t plan on doing this very often. But it sure was fun!
* One Problem Solved, New Problem Found
Posted on February 22nd, 2008 by Phil. Filed under Drupal, PHP, SQL, cron.
Not like I needed it, but the impending snow storm is again proving that winter stinks (to put it in family-friendly terms). I blame that over sized rat in Pennsylavania.
I’ve (temporarily) solved the timeout problem I’ve been having with importing TV schedule feeds using FeedAPI. Ideally, due to the time it takes to run them, I’d like to be able to schedule these feed imports separately from the rest of the Drupal cron jobs. The multi-threaded cron patch looked promising, but it was written for Drupal 5.1 and we’re currently building on 5.7. At this point I don’t want to get too deep into engineering some custom cron functionality. Instead, I’ll settle for a shortcut to allow my feeds to be imported nightly via the normal Drupal cron routine without timing out.
After digging a little deeper into the FeedAPI code I found that it’s implementation of hook_cron calls the function feedapi_cron_time before processing each row in a feed. That function basically checks that the total run time of the feed import hasn’t exceed the allotted percentage of total cron run time. For now, I’ve decided to simply have that function aways return true, so a timeout won’t happen.
Problem solved - for now. Hopefully I can revisit this later and find a better solution, like maybe multi-threaded cron will be more fully fleshed out and available.
Now that the entire feed import process is working I have a new problem.
I’d like to clean out the program and schedule data that I’ve accumulated in our development database and start importing data using the new code. This is actually quite a bit of data, as it includes data imported by an earlier version of the feed ingestor that we wrote last year. In fact, as of right now we have:
- 539 programs
- 4,926 episodes
- 27,731 airings
This is far too much data to delete using the clunky delete content functionality (i.e. the node_delete function) in Drupal. That functionality is also pretty damned slow, as it has to churn through lots of PHP code and functions for each node to delete.
We’ve got a custom module for the TV schedules data that implements hook_nodeapi. If an episode gets deleted, this implementation makes sure all child airings are deleted. Likewise, if a program gets deleted, all child episodes (and hence all of their airings) get deleted. This ensures we should never have orphaned airings or episodes.
Last night I tried to delete 10 programs using the standard node_delete functionality and it took at least 30 minutes.
Obviously, deleting all program and schedule data isn’t something we’ll be doing often. But I do need to do it at least once and I may need to do it again at some point during the build and I don’t want to wait several days for it to get done. So I need a faster solution for deleting large amounts of data.
I did find this delete all module, which claimed to it allowed you delete content by content type and to choose either the standard (and safest) node_delete option or a faster (and riskier) bulk delete just using SQL statements. However, when I installed it I found all it offers is the option to delete all content and only using the usual PHP functions.
Soooo, I’m about to dig in and see if I can write my own SQL statements to delete this content cleanly and quickly.
In between round of shoveling my driveway, of course…
* Not So Fast, Mr. Smarty Pants
Posted on February 20th, 2008 by Phil. Filed under Drupal, PHP, Television, cron.
OK, that last post may have been premature.
I’m still having a timeout problem with the TV schedules import routine. Before I get into the latest issue, let’s take a step back and review what we’re doing here.
As I’ve said before, we’re importing TV program and schedule data from PBS via XML. These data are generated by TV Guide and provided to us as RSS feeds, one for each of our channels, with each daily file containing schedule data two weeks into the future.
In order to support the publication of a schedule grid, daily channel schedules and program/series/episode pages, we’re using CCK to create the following custom content types:
- TV Channel
- TV Program
- TV Episode - Related to one program via node reference
- TV Airing - Related to one episode and one channel via node references
We’re coding a routine to import the program and schedule XML data and create program/episode/airing nodes. The import is done using FeedAPI, the SimplePie parser and a custom feed processor of our own construction. We were originally thinking of using the FeedAPI Node module to process the data and create the nodes but didn’t because (1) we need to create or update up to three different nodes for each line of a feed (program/episode/airing nodes) and (2) the logic to match feed content to existing nodes is, due to the nature of the PBS data, quite messy.
We’ve got the import, parsing and processing code working. However, as I mentioned in my previous post, we’re running into issues with process timeouts during the feed refresh. There’s quite a bit of processing going on for each line in a schedule feed. Given that and the fact that each feed has 400+ items and that we’re working with six schedule feeds (one for each of our channels), the nightly feed refresh process is not quick. At this point, no real application or database tuning has occurred, which is part of the problem, but I think that even after everything is tuned to the hilt for performance we’ll still have this issue.
I first ran into this last week when I tried to refresh a feed using the Refresh link on the Feed Administration page. As I said earlier, I realized that this process was restricted but the maximum execution time PHP variable, which defaulted to 30 seconds. I was able to get around this by using the set_time_limit function to restart the timer as each row in the feed was processed. That worked great for refreshes initiated on the Feed Administration screen.
However, things broke down again when I tried to have the feeds refresh via cron. I found that drupal_cron_run, the function in includes/common.inc which executes cron jobs, sets the cron timeout to 240 seconds using set_time_limit. This should still be overridden by my code’s calls to set_time_limit.
After digging into the workings of FeedAPI I found that that module’s implementation of hook_cron checks after (or before, I forget which) each row of a feed is processed to make sure that the process is not using up more than the allotted percentage of the total cron run time as specified on the FeedAPI administration screen. I’ve got that percentage set to 75%, which works out to 180 seconds. Not enough.
The bottom line here is that we’ve got an issue with refreshing our feeds nightly via cron without being restricted by these timeout conditions. Ideally, I’d like to be able to schedule each feed refresh separately, outside of the global Drupal cron job. I’d also like them to run in a multi-threaded fashion.
It seems to me that the Drupal cron functionality needs a bit of work. It’d be nice, in general, to be able to schedule cron jobs for different modules at different times and at different intervals and even give them different timeout limits. The only thing I see out there which may help is this patch for multi-threaded cron jobs. I haven’t tried it yet, but will.
If you have a brilliant idea for how to crack this nut let me know by leaving a comment. Also, if you have any hot stock tips, leave those also.
* set_time_limit: Learn It, Know It, Live It
Posted on February 16th, 2008 by Phil. Filed under PHP, TV Guide, Television.
I’ve got the basic code for importing our TV programs and schedule data working. In a nutshell, PBS provides us an XML feed of schedule data provided by TV Guide for each of our channels. Using FeedAPI and a custom module we’re importing these data and creating various nodes. It basically works as intended! Gotta like that.
More later on the exact implementation details, but I was running into one problem: each time I tried to ingest a feed I was running into the following errors:
Fatal error: Maximum execution time of 30 seconds exceeded in… blah blah blah
Based on my years of application development experience and a very keen gut instinct I quickly surmised that a fatal error is bad. I then dug in to see what was doing here.
As the error message said, the code was timing out; it was taking longer than the maximum execution time as defined by the PHP setting variable max_execution_time. One possible solution here is to increase this value in the setting.php file. For example, we could double it to 60 seconds via:
ini_set('max_execution_time', '60');
This would increase the maximum execution time for all Drupal processes. Rather than do that, I chose option B, which involves the PHP function set_time_limit. You can call this function in a PHP script and it will restart the timeout counter, effectively increasing the maximum execution time on the fly.
So, I added the following call to a routine in the feed processing code, which gets called each time a record in the feed:
set_time_limit(30);
Voila! Problem solved. Time for a beer.
Archives:
- February 2009
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
Categories:
- Apache
- Architecture
- Boost
- caching
- CCK
- CMS
- cron
- CVS
- database
- Date
- Devel
- Drupal
- Drupalcon
- FeedAPI
- Flickr
- Image Assist
- Images
- Install Profiles
- MacBook
- Memcache
- MySQL
- NPR
- Pathauto
- PBS
- PHP
- Preview
- Protrack
- Public Media
- search
- Social Media
- SQL
- SVN
- tags
- Television
- Testing
- theme
- TinyMCE
- Token
- Tools
- TV Guide
- Uncategorized
- Views
- WordPress
Disclaimer
- The opinions expressed in here are those of the writers/contributors and do not necessarily represent the views or opinions of the WGBH Educational Foundation.














