Keep on Searchin’

Visitors to WGBH.org often come to the site seeking out information about a specific TV program. Maybe they enjoy Frontline, and they want to find upcoming episodes. Or they caught the last 5 minutes of a show about hot dogs, but they can’t remember the name of the program and they want to know if will be airing again.

So one of the requirements of the TV Programs and Schedules was that we implement a scoped search – an advanced search page that would only return a list of TV episodes in the results.

TV Search - NOVA
In the past, I’ve used some interesting modules that modify or expand upon the core Drupal search. The Views Fast Search module offers the flexibility to define the content you perform a search on, which really enhances the searching capabilities. Unfortunately, the module is only available for Drupal 5 (although parts of Views Fast Search made it into Drupal 6). Drupal’s built-in advanced search form is also capable of limiting a search query to specific content types — there are several different approaches to achieving this. And the Restricted Search module allows administrators to exclude content types from the search index entirely.

But simply blocking other content types from the query won’t quite cut it for several reasons, and excluding content from the search index would not be a good long-term solution, because eventually we will need to make use of the full site search in addition to this scoped search. Also, just to spice things up a bit, the additional criteria for the TV search specified that:

  • • The search should only return TV Episode nodes. The Airing and Program nodes do not show up in the search results, although they do play a factor in the indexing of the Episode nodes and the ordering of the results.
  • • Search results should include the program and episode title, a brief description, a link to the episode page on WGBH.org, and a link to the program web site, if there is one.
  • • In ordering the search results, keyword relevance is the most import factor, but upcoming airings are a close second. For example, a search for “Curious George” would yield a long list of episodes for that program, but the episode that is airing this afternoon would be at the top of the list, followed by the episode airing tomorrow morning, and so on.

The real heavy lifting of Drupal’s search mechanism can be broken down into two areas: the indexing of the nodes (hook_update_index()) and search query (hook_search()). Both of which involve some code that quickly made my head hurt. But as luck would have it, I pulled out our copy of Pro Drupal Development and discovered a whole chapter dedicated to search. That, combined with Robert Douglass’ very detailed blog post, Drupal Search: How indexing works, worked wonders like a big bottle of ibuprofen.

Indexing

When cron runs, Drupal will index any new nodes, and reindex nodes that have changed since the last run. The title and body of a node, with all HTML tags intact, are parsed — Drupal uses the HTML tags to give additional weight to words. Text in an H1 tag must be important, so those words would carry a very high score, while linked text would carry a lower score (although higher than plain text). Words that are bolded, italicized, or underlined also get a small boost.

This is why a node with “Nova” in the title scores higher than a node with “bossa-nova” in the description, when the search term is “Nova”.

Overriding the Index

For our purposes, when we index an Episode node, we also want to include the title and description of the parent Program in that index. It is entirely possible that an episode of Nova, for example, might not even mention the word “Nova” in the title or description, so we must include the Program title and description.

To achieve this we use hook_update_index() to loop through any new Episodes. We load both the Episode node and the parent Program node, and then build a string with both the Program and Episode titles in H1 tags, and append the body of each node with all HTML tags intact. That string is then passed off to search_index() where each term is counted, scored, and added to the index.

Search Query: Ordering the Results

As the requirements specified, the results of the search query should be weighted with keyword relevance and upcoming airing date being the primary factors in determining the order.

Keyword relevance, of course, is a standard part of the Drupal search ranking mechanism, but to affect the score based on upcoming airings, we construct an additional ranking query. That query, which returns the difference of the upcoming airing timestamp and the end of the data window (or 1 if there are no upcoming airings), is passed to Drupal’s do_search() function. An array of node IDs is returned and passed off to the theme level.

One very nice thing about Drupal’s search is that this custom search was developed without impacting the existing full site search capability. No core code needed to be touched, and in the future we can add scoped search to other areas (like Radio) by replicating several functions and adding a few case statements.

It’s Go Time!

After weeks and months of working on something you can easily forget that - at some point - it actually has to get done! Well, I’m not able to say that our new TV Programs and Schedules module is completed yet, but we have reached a big milestone: principal development is done!

What does that mean? Well, it means that Pete and I have coded up everything that we know about to the specifications provided and now it’s ready to be fully tested. You could basically call it an alpha release of the front and back end code.

Exciting? That’s one word for it.

The last piece of the puzzle was a custom search module, that Pete has coded up, based on the core Drupal search module. Basically, it’s search scoped to our TV programs and episodes (that’s all the content we have in Drupal just now anyways). Down the line, as we rebuild the whole rest of WGBH.org, we also envision a scoped search for each major subsite (e.g. one for radio, one for web only content, etc.), plus some sort of global search across everything.

We’ll write up more about how search was implemented later but for now, here’s a sneak peak:

TV Programs Search

Our goal in the next few days is to get the code base installed on our test/staging servers, document how things work and then let our content producers and functional testers have at it!

Obviously, I don’t anticipate there being any bugs, issues or blemishes of any sort. But we’ll go through the charade of testing anyways, just make everybody feel good!