* Pass the Aspirin
Posted on October 24th, 2008 by Phil. Filed under Drupal, PHP, Television, Views, tags.
For those of us in the northern hemisphere, fall has arrived! In between raking up and burning piles of leaves (and useless 401K statements), we here at WGBH Online have continued to fine tune our new(ish) TV Programs and Schedules pages.
As you may recall, not long after launch in August, we began to revisit the whole notion of how we’re tagging our TV programs and episodes. The main reason was to improve the way we generate lists of related programs, so as to suggest to visitors other shows they might like. Our initial approach was simple: just tag the programs (not individual episodes) and use a Drupal view to generate a list of up to three related programs.
But this soon proved restrictive. Sure, Frontline is a News and Public Affairs program, but individual shows in the series can be about different things (technology, politics, science). So, we wanted to be able to capture this more detailed level of information and use it to generate more useful lists of related programs for our visitors.
After much thought and discussion (not to mention headaches), we came up with an expanded tagging scheme and more sophisticated program matching logic, which has now been implemented on the site. Here’s what we did:
We renamed our existing TV Program Genre vocabulary to TV Program Primary Genres.
The terms remained the same (a small set of high level classifications) and these are still only applied at the program level.
We then added a new vocabulary that can be applied to both TV programs and episodes: TV Program/Episode Secondary Genres.
This secondary list has many more terms that now allow for a more sophisticated level of classification. tags applied at the program level apply to all episodes in a series. Tags applied at the episode level are only applicable to that particular episode.
Once we had that in place we then had to think about how, using these tags and given a single program episode, we would define rules for identifying “related” programs and episodes.
This is where the aforementioned headaches started to kick in.
Once you started to think about it, all sorts of questions cropped up, like, which carries more weight, matching primary genre tags or secondary genre tags (or should they count equally)? Or, assuming two related programs have the same tags as the target episode, how to break the tie? Or, do we match an episode within one series to other episodes in that series or restrict it to episodes of other series?
Pass the aspirin, because I’m getting a headache just thinking about it again.
Luckily, we have some fine folks working here who sat down and really noodled through this to come up with some matching logic. When written out, the matching rules looked something like this:
1. Match at the episode level
2. Cull only from upcoming or recently-aired episodes
3. Look for most tag matches, with all tags equally weighted
4. Only allow one episode per program/series to appear in “You Might Also Like” box
5. In a tie, give priority to episodes with same “Program Primary” tag
6. If still a tie, give priority to episodes with exact same tag makeup (i.e. both have only one Primary tag)
7. If still a tie, give priority to the episode with soonest upcoming airing.
The idea was then to use the tags and these rules to generate up to three matches for each episode to display in the “You might also like” block in the right hand rail.
Well, up to three matches, unless there were more than three episodes with the exact same tag structure as the target episode. In that case, we will display up to five such matches.
No sweat!
In order to actually implement this, we could no longer just spit out the results from a view. Nope. Instead, we had to jump through a whole bunch of hoops. Here’s the thumbnail sketch of the implementation:
1. Given the tags for a target episode, query a view of TV programs, fetching all programs that match at least one Program Primary or Secondary tag.
2. Filter this list of programs, including only programs with an airing in our schedule data window (one week back, two weeks ahead).
3. Then count the exact number of tag matches and calculate a matching score for each program, based on the above rules. Then store the program in an array.
NOTE: I won’t go into the exact matching score formula here. Suffice it to say we came up with a formula that encapsulates the above matching and ordering rules. Please pass the aspirin again…
4. Next query a view of TV episodes, fetching all episodes that match at least one Episode Secondary genre of the episode in question.
5. Filter this list of episodes, including only those with an airing in our schedule data window. For each one count the exact number of matching genre tags for the episode and calculate the matching score. See if the episode’s parent program is already in the array of matching programs. If so, replace it with this episode if the matching score is higher.
6. Given the final array of matching episodes, reorder the array by the matching scores and display the top three (or five) entries!
The resulting PHP code to implement all of this ran to about 240 lines and looked a little something like this:
All that just to generate this on the front end:
Anybody know the limit on the number of aspirin you can take in one day?
Leave a Reply
You must be logged in to post a comment.
Archives:
- February 2009
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
Categories:
- Apache
- Architecture
- Boost
- caching
- CCK
- CMS
- cron
- CVS
- database
- Date
- Devel
- Drupal
- Drupalcon
- FeedAPI
- Flickr
- Image Assist
- Images
- Install Profiles
- MacBook
- Memcache
- MySQL
- NPR
- Pathauto
- PBS
- PHP
- Preview
- Protrack
- Public Media
- search
- Social Media
- SQL
- SVN
- tags
- Television
- Testing
- theme
- TinyMCE
- Token
- Tools
- TV Guide
- Uncategorized
- Views
- WordPress
Disclaimer
- The opinions expressed in here are those of the writers/contributors and do not necessarily represent the views or opinions of the WGBH Educational Foundation.














