Providing Site-wide Full-Text Search with Oracle interMedia (ArsDigita Systems Journal)

Providing Site-wide Full-Text Search with Oracle interMedia

by Michael Bryzek (mbryzek@arsdigita.com)

Submitted on: 2000-07-01
Last updated: 2000-09-08

ArsDigita : ArsDigita Systems Journal : One article

Oracle interMedia provides full-text indexing of data in an Oracle database. For an Oracle-backed website, interMedia allows users to search the content of the site by keywords, a key feature for most web services. For programmers, interMedia provides a tightly coupled SQL interface for writing search queries.
The current Site-wide Search module in the Arsdigita Community System (ACS) uses interMedia to index text in an arbitrary number of tables. This module makes it easy to index additional data in any tables in the data model, demonstrating the power of interMedia.
Unfortunately, if interMedia is not configured correctly from the start, it is a nightmare to administer. This article presents an overview of our implementation of site-wide full-text search with Oracle interMedia for http://www.guidestar.org and http://www.volunteersolutions.org, based on our experience with what does and does not work in an interMedia installation. In particular, we focus on maintaining the interMedia indexes so that they are always up-to-date and never break. We do not address setup/configuration details which are fairly well-documented (see http://technet.Oracle.com/products/intermedia/htdocs/text_techover.html). In most cases, installation is not difficult and requires you to source a few files in the $ORACLE_HOME/ctx directory and to configure Net8.
Alternatives to Oracle interMedia
Full-text search products can be placed into two broad categories:

Stand-alone software primarily used to search static Web sites or flat files. One example is PLS (http://www.pls.com), which requires database users to export their data to the file system, and then to index them using PLS software.
Existing Web sites that offer their search technology as a service for other web sites. One example is AltaVista Search (http://solutions.altavista.com/) which provides out-of-the box flat file searching and a developer's kit to integrate with a database or other source of data.
When using the Oracle RDBMS, interMedia provides the following advantages:

Faster synchronization of changing table data with the full-text index. Because interMedia is part of Oracle, we can provide immediate, or very rapid, synchronization.
Tight SQL integration, keeping the query format similar in structure to SQL. This integration allows new users to learn the query language more easily.
Potential for better scoring by indexing only the text we need. We can easily "tell" interMedia to index only the data in which we are interested, ignoring much of the noise apparent in text found in web pages (e.g. HTML used for layout).
Any application that talks to Oracle can take advantage of the existing interMedia indexes, making it easy for external applications to execute the same queries as the production web site.

Overview
We implemented site-wide full-text search on both GuideStar.org, a site that lists financial information on over 700,000 American non-profit organizations, and VolunteerSolutions.org, a much smaller site that provides volunteers with information about local opportunities. Both of the sites frequently update the information associated with each record, requiring us to constantly update the indexes used for full-text searching. Most of the code samples we present in this article are taken from the Volunteer Solutions site as the logic is simpler and the code is easier to understand.
In both sites, the data we wanted to index was spread out among a few tables and columns. The implementation of full-text search consists of the following major steps:

Synchronize the data from all the sources into one central table (the process of denormalizing the data to index)
Synchronize the interMedia index or indexes with the denormalized data

Step One: Synchronizing the data to index
We need to synchronize the data we want to index as we have multiple sources (e.g. the data that we want to index is dispersed through several tables). The first step is to create a central, site-wide search table. We need to include at least one column containing all the text we want to index and we should include extra information that we will use when presenting results to the user. Placing this information in the same table is important for the following reasons:

interMedia will compute statistics about all the text we are indexing. By using one table to store everything, we are able to compare results from various queries. If we stored the data in separate tables, we would have no statistics or information by which to make a solid comparison.
Searching one table will be faster than searching through two joined tables. Though performance is not always a concern, it is a nice side-effect of maintaining a site-wide search table.
-- VolunteerSolutions.org site-wide search table

create table search (
        -- the primary key points back to the detailed opportunity pages
	opp_id 			integer 
				constraint search_opp_id_pk primary key 
				constraint search_opp_id_fk references vs_opp,
        -- the agency that is sponsoring this volunteer opportunity
	agency_id		constraint search_agency_id_fk references vs_agency
				constraint search_agency_id_nn not null,
        -- basic user group information to save a join with user groups!
	group_id		constraint search_group_id_fk references user_groups,
	parent_group_id		constraint search_parent_group_id_fk references user_groups,
        -- different ways of looking at the sponsoring agency's name
	agency_name		varchar(150),
	agency_name_sort_key	varchar(150),
	agency_name_initial	char(1),
        -- where is this opportunity located?
	city			varchar(100),
	state			varchar(2),
	-- first 5 digits of the zip code
	zip			varchar(5),
        -- is this a one-time or ongoing opportunity
	one_time_p		char(1),
        -- latitude and longitude for zipcode/radius searching
	latitude		number(9,6),
	longitude		number(9,6),
        -- different ways of looking at the name of this volunteer opportunity
	headline		varchar(200),
	headline_sort_key	varchar(200),
	headline_initial	varchar(200),
        -- is this opportunity private to the specified group_id?
	private_p		char(1),
        -- foreing URL describing the opportunity (or agency) in more detail
	url			varchar(100),
        -- the text we want to index
	keywords 		clob,
        -- a description of the opportunity to show users
	description		clob,
        -- one clob for each category type (taken from the categories table)
        -- this lets us search for keywords within a category. Note that
        -- we could have created the same functionality using interMedia's
        -- Section Groups
	category_group		clob,
	category_interest	clob,
	category_skill		clob,
	category_time		clob,
	category_interaction	clob
);
Next, we need to collect the data we want to index from within the data model. On both GuideStar and Volunteer Solutions, we wanted to index text stored in approximately ten columns throughout the data model. We created PL/SQL procedures to combine the data we wanted to index from all the tables into the single column named keywords in the table search. The following code sample is the portion of the PL/SQL procedure from VolunteerSolutions.org that updates most of the columns in the search table.
   update search 
    set agency_id=v_agency_id,
        group_id=v_group_id,
        parent_group_id=v_parent_group_id,
        zip=v_zip,
        city=v_city,
        state=v_state,
        one_time_p=v_one_time_p,
        latitude=v_latitude,
        longitude=v_longitude,
        headline=v_headline,
        headline_sort_key=v_headline_sort_key, 
        headline_initial=v_headline_initial, 
        url=v_url,
        agency_name=v_agency_name, 
        agency_name_sort_key=v_agency_name_sort_key, 
        agency_name_initial=v_agency_name_initial, 
        private_p=v_private_p, 
        keywords=empty_clob(),
        description=empty_clob(),
        category_group=empty_clob(),
        category_interest=empty_clob(),
        category_skill=empty_clob(),
        category_time=empty_clob(),
        category_interaction=empty_clob()
    where opp_id=v_opp_id;

   -- now update all the lobs
   select keywords, description, category_group, category_interest, 
          category_skill, category_time, category_interaction
     into v_keywords, v_description, v_category_group, v_category_interest, 
          v_category_skill, v_category_time, v_category_interaction
     from search where opp_id=v_opp_id for update;

   -- pull out the description (also a lob)
   select description into v_description_temp from vs_opp where opp_id=v_opp_id;

   -- set up the new keywords
   dbms_lob.writeappend(v_keywords, length(v_buffer), v_buffer);
   dbms_lob.writeappend(v_keywords, 1, ' ');
   dbms_lob.append(v_keywords, v_description_temp);

   dbms_lob.append(v_description, v_description_temp);
Once the site-wide search table is populated, we need to ensure that future inserts, updates, and deletes to the underlying data are reflected in the information kept in the site-wide search table. There are two basic approaches to synchronizing the data stored in the site-wide search table:

Immediately update the data
Mark the row in the site-wide search table for later update

Approach 1: Immediately update the data
If the amount of work we need to do to update an individual row of data in the site-wide search table is small, immediately updating the data in the site-wide search table makes sense. To implement immediate updates, create row-level triggers on all the tables Oracle has to access when it updates the corresponding data in the search index. Whenever an underlying table is modified, a trigger will fire, immediately updating, or removing, the information in the site-wide search table. We use this approach in ACS to keep static-pages up-to-date:
create or replace trigger static_pages_sws_insert_tr
  after insert on static_pages for each row
  WHEN (NEW.index_p = 't')
BEGIN
  -- we have to create a new row in the index table for this row.
  insert into site_wide_index 
  (table_name, the_key, one_line_description, datastore)
  values 
  ('static_pages', :new.page_id, :new.page_title, 'a');
END;
One rule of thumb for deciding what constitutes a small amount of work is to count the number of rows in all tables that must be accessed to update the corresponding row in the site-wide search table. If you only need to access one or two rows, you should probably use this method.
Approach 2: Mark the row for later update
The second approach relies on triggers to mark a row in the site-wide search table as dirty. A scheduled process will later update the data in the row. This is advantageous as it allows for batch processing of dirty rows which may decrease the total time needed to update the rows. By updating the index in batches, we can improve the overall quality of the index, since interMedia will reduce index fragmentation by grouping similar words together during the index process.
There are a couple of ways to store the dirty bit:
Bad: Store a dirty_p column in the site-wide search table. Though this is a simple solution, it will force a write to the site-wide search table, decreasing overall performance. The key to maintaining top search performance is to ensure that any book-keeping we do is separate from the data associated with our site-wide-index. Since the dirty_p column is in the same tablespace and datafile as the rest of the site-wide search table, updating this column will force an unnecessary disk access to the same drive that holds your site-wide search table.
Good: Create a table to store a list of rows that need to be updated. One way to do this is to create the following table:
create table search_dirty_rows (
	on_which_table		varchar(50) not null,
	on_what_id		integer not null,
	constraint sdr_which_table_what_id_pk 
            primary key(on_which_table, on_what_id),
	creation_date		date default sysdate
);
Note that you may consider removing the primary key constraint to improve write performance to this table. We recommend the constraint as it leads to cleaner code to update the corresponding data in the search index as we do not neet to worry about updating the same data twice.
This solution also lends itself to a simple pipeline for keeping the data current:
Write triggers for each table on which the site-wide search table depends. Example:
create or replace trigger vs_agency_tr
before insert or update on vs_agency
for each row
DECLARE
  v_exists_p integer;
BEGIN

  select decode(count(*),0,0,1) into v_exists_p 
    from search_dirty_rows 
   where on_which_table='vs_agency' 
     and on_what_id=:new.agency_id;

  if v_exists_p = 0 then 
    BEGIN
      -- We have to catch the case where another row is created
      -- between the time we've counted and are going to insert
      -- Note that the exception means the row already exists
      -- so we don't have to do anything if we catch the exception
      insert into search_dirty_rows
      (on_which_table, on_what_id)
      values
      ('vs_agency', :new.agency_id);
    EXCEPTION WHEN OTHERS THEN NULL;
    END;
  end if;
END;
Write one procedure that knows how to update the site-wide search table for every table that has a trigger on it (this is the same procedure used to initially populate the site-wide search table).
Regularly update the dirty rows.
The method using search_dirty_rows leads to code that is easier to understand and simpler to maintain. It is also easy to add data from new tables to the site-wide search table, since the procedures to keep the data up-to-date are independent.
Step Two: Synchronize the interMedia indexes
Now that we have created a site-wide search table, we need to tell interMedia what text to index. interMedia's datastore class is responsible for translating any data into text that can be indexed (the data itself can be stored as a blob, varchar2, or clob with clob being the default). The datastore class is made up of the following objects:

Direct datastore - directly uses the text from one column in a table.
User datastore - invokes a procedure for each row that you are indexing. This procedure returns the text to index based on a rowid.
Detail datastore - combines multiple rows into the text to index.
URL datastore - fetches the URL stored in the column and indexes the returned text.
File datastore - indexes the contents of the specified file.
The direct and user datastore objects are the most useful for site-wide full-text searching. Using the direct datastore object may require additional storage space as we store a duplicate copy of the text to index in a column of our site-wide search table. The primary advantage of the direct datastore is that we can simply query the table to see what text is being indexed. With the user datastore object, we have to execute the specified procedure to see what text is being indexed.
The example of the site-wide search table presented in this article uses the direct datastore object by directly storing the text to index in the keywords column of the table search. To index the text in this column, we execute:
    SQL> create index search_keywords_im_idx 
         on search(keywords) 
         indextype is ctxsys.context;
This command creates an interMedia index on search(keywords). Note that the "im" in the name of the index simply lets us quickly see that this is an interMedia index. The amount of time it takes to create the index depends directly on the hardware you are using and the amount of data that you are indexing. On GuideStar, it takes about 2 1/2 hours to index approximately 1gb of data on a Sun E450 with 4 processors and 4gb of RAM. Note that in order to create interMedia indexes, the oracle user must have been granted the ctxapp role.
Most of the time, interMedia does a good job of creating indexes. If you get errors during the build process, either you have not properly installed interMedia or your index needs to be dropped. Most of the errors we have seen have come from Net8 configuration errors (see http://oradoc.photo.net/ora816/network.816/a76933/preface.htm).
Performance with interMedia indexes varies widely, primarily depending on how common the query is. For example, a rarely used query such as "guidestar" will return quickly whereas a common word such as "philanthropy" will take significantly longer:
SQL> select count(*) 
       from search 
      where contains(keywords, 'guidestar', 1) > 0
      order by score(1);
  2    3    4  
  COUNT(*)
----------
	 5

Elapsed: 00:00:00.02

SQL> select count(*) 
       from search 
      where contains(keywords, 'philanthropy', 1) > 0
      order by score(1);
  2    3    4  
  COUNT(*)
----------
     32517

Elapsed: 00:00:01.99
Once we have built the interMedia index, we need to keep it synchronized with the data in the site-wide search table. Unlike Oracle B-tree indexes, interMedia indexes are not updated during the execution of SQL transactions. There are two ways to update the interMedia indexes:

Run Oracle Context Server (ctxsrv)
Manually alter the indexes

Method One: Oracle Context Server
Oracle Context Server immediately updates an interMedia index after any change to the underlying data. This process frees the user from having to think about updating the interMedia indexes as the index updates are transparent. To start context server, execute the following from the Unix shell:
 
> ctxsrv -user ctxsys/ctxsyspwd -personality M -log $ORACLE_HOME/ctx/ctxsys.log &
where ctxsyspwd is the password for the Oracle user ctxsys.
There are two major problems with Oracle Context Server:

Immediately updating intermedia indexes leads to greater fragmentation and thus poorer performance.
Oracle Context Server consistently crashed every few days when we ran it on GuideStar.

Method Two: Manually update the index
The second method is to manually update the index on a regular basis. The simplest way to update an interMedia index is to open a SQL*Plus session as the user who created the index. Then:
SQL> alter index search_keywords_im_idx rebuild online
     parameters('sync memory 45M');
        
Index altered.
The parameters of this call:
rebuild online: Rebuild the index incrementally so that it is still accessible. If you forget to specify "online," users searching the Web site would not be able to access the index.
sync: Just update the index (rather than rebuild it entirely)
memory 45M: Specify the amount of memory, in megabytes, that interMedia can use while rebuilding the index. Using more memory decreases the amount of disk I/O and thus increases the speed with which the index can be rebuilt. The default is 12 megabytes and the acceptable range of inputs for memory is 1 KB to about 50 MB. To increase the maximum allowable index memory, execute (as the ctxsys user):
 SQL> exec ctx_adm.set_parameter('max_index_memory', '1G');
More information is available in the http://oradoc.photo.net/ora816/inter.816/a77063/cdatad11.htm
"Manually" updating the index is the preferred method. The best way to do this is to run a script every hour (or however often you want to update your index) that executes the appropriate ALTER INDEX statements. The script should run from a robust environment that is relatively safe from interruptions such as a Unix cron job or as an Oracle DBMS job.
We maintain the interMedia indexes on GuideStar and Volunteer Solutions with ad-context-server.pl, a small perl script that we run as a cron job (see http://software.arsdigita.com/bin/ad-context-server.pl). This script uses Perl DBI and DBD-Oracle to connect as the ctxsys user, reads a list of all indexes that need to be updated (SQL> select PND_INDEX_NAME, count(*) from ctx_user_pending group by PND_INDEX_NAME;), and executes the alter index statements. Using this script, we have run both GuideStar and Volunteer Solutions with no hitches for over half a year now.
As of Oracle 8.1.6, another safe way to update the index is to submit a DBMS_JOB to issue the update statement. The webmail module in the ACS uses this method:
-- Resync the interMedia index every hour.

DECLARE
  v_job number;
BEGIN 
  dbms_job.submit(v_job, 
                 'ctx_ddl.sync_index(''search_keywords_im_idx'');', 
                 interval => 'sysdate + 1/24');
END;
Updating the index from within an environment that is constantly restarted, e.g. from within a web server, is a bad idea. On GuideStar, we initially created a simple procedure that ran every hour from within the web server and simply executed the alter index statement. One night we decided to explicitly update the index:
 
   SQL> alter index search_keywords_im_idx rebuild online 
        parameters('sync memory 45M');
Two hours later, all keyword searches on the Web site were failing. Users were getting back a page with zero results, regardless of their query. We exectued another alter index statement and got back an error saying that the index was marked as FAILED/LOADING. Had we been smart, we would have logged what was going on by executing, as the ctxsys user:
   SQL> exec ctx_output.start_log('better_days');           
to log the progress of the alter index statement. This command would have logged all errors and notices to $ORACLE_HOME/ctx/log/better_days.
You can also look at interMedia errors to find out what went wrong by opening a SQL*Plus session as the ctxsys user and running select err_timestamp, err_index_name, err_text from ctx_index_errors order by err_timestamp; to get a list of all the errors on interMedia indexes, sorted by date.
The resume parameter to the alter index statement is designed to finish updating an index:
 
   SQL> alter index search_keywords_im_idx rebuild online 
        parameters('resume memory 45M');
The resume statement has never worked for us but maybe you will have better luck. Probably, you will just see the same error message you had before. Somehow the index has become corrupted:
 
   SQL> drop index search_keywords_im_idx force;
   SQL> create index search_keywords_im_idx on search(keywords) 
        indextype is ctxsys.context;
And 2 1/2 hours later, the index was re-created and back on-line. Meanwhile, our Web site was not serving search results for over three hours.
This error occurred because the hourly procedure running inside our web server to update the interMedia index started up in parallel to the one we manually executed in SQL*Plus. These two threads competed for some resource, and ended up in a deadlock. We have been able to replicate this same error in two other ways:

We had a scheduled procedure running in AOLserver updating the index. We restarted the web server, and ended up with two concurrent alter index statements running.
We were running Oracle context server, and executed an alter index statement from SQL*Plus.
Note that an uncommitted session in SQL*Plus can keep the alter index statement from completing. Also, the probability of deadlock is directly related to the amount of time it takes for your index to be updated. If you update your index once an hour, and the process only takes a couple minutes to finish, it is unlikely that you will ever see a deadlock. On GuideStar, some index updates would take a few hours making it easy to end up with a deadlock.
When manually altering interMedia indexes, you have to make sure that no more than one alter index statement is running at any given time. More than one statement could result in a deadlock inside interMedia that can only be solved by dropping and recreating the index.
The people at interMedia tell us that the deadlock should not happen, and we agree! However, the reality of the deadlock gives us the following interMedia law: Use exactly one method to update interMedia indexes.
Known bugs
Oracle 8.1.5 has a bug in the parser for the about clause. The bug seems to incorrectly map some multi word queries to their corresponding tokens in the interMedia tables. For example, about('massachusetts bay'), which is tokenized as 'Massachusetts Bay', is parsed as 'Massachusetts bay', not matching any row in the interMedia tokens table and thereby throwing an error. If you find yourself in this situation, you can remove the about clauses in your queries.
We had many problems running interMedia in Oracle 8.1.5 and 8.1.5.1. On GuideStar, we had to restart the Oracle database every 1-2 nights just to maintain an adequate level of performance. We never managed to fully trace down the problem, but we were getting a lot of loose processes that simply bogged down the database.We were also constantly (twice a day at one point!) seeing ORA-600 errors:
00600, 00000, "internal error code, arguments: [%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]"
// *Cause:  This is the generic internal error number for Oracle program
//         exceptions.  This indicates that a process has encountered an
//         exceptional condition.
// *Action: Report as a bug - the first argument is the internal error number
All of these problems went away when we upgraded to Oracle 8.1.6. If you are going to use interMedia, we do not recommend using any release prior to Oracle 8.1.6. Oracle has also released patches for 8.1.6 to address a few initial problems, such as theme-based queries. These patches should be applied before using interMedia.
General Tuning Considerations
On the web, we are often interested in getting back the first few rows that match a query as fast as possible (as opposed to generating a list of all the matches). As with other Oracle SQL queries, interMedia makes use of the FIRST_ROWS hint to optimize the query to return the first rows in the shortest amount of time. See http://technet.Oracle.com/products/intermedia/htdocs/text_techover.html#query_tuning for more information about query optimization.
As you continue to update your interMedia indexes, the tables interMedia maintains become more and more fragmented. You may want to optimize your index by executing the following statement in SQL*Plus (after removing the ad-context-server.pl from your crontab - no sense risking a deadlock!):
 
    SQL> alter index search_keywords_im_idx rebuild online 
         parameters('optimize fast');
Optimization is an extremely slow process and it is not clear that the performance gain of an optimized index will be noticeable. If you are going to optimize, you should limit the amount of time that the optimization will take to, for example, 180 minutes:
 
    SQL> alter index search_keywords_im_idx rebuild online 
         parameters('optimize full maxtime 180');
The interMedia documentation (http://oradoc.photo.net/ora816/inter.816/a77061/ind10.htm) has more information about optimizing indexes.
Another consideration is the layout of the data files used to store the site-wide search table and the interMedia indexes. Ideally, you will want to put your denormalized site-wide search table in its own tablespace on its own mirrored drive and use another tablespace on another mirrored drive to store the indexes. Separating the indexes and the data will minimize I/O contention leading to better performance.
Text parsing - the interMedia query language
To use interMedia full-text search, you must learn interMedia's query language. Here we present the most commonly used functions in terms of full-text search on the web.
The basic syntax is to add contains and score clauses to your SQL query. An example is the simplest way to learn this syntax:
SQL> select org_name, score(1) as the_score
       from search 
      where contains(keywords, 'guidestar', 1) > 0
      order by score(1) desc;

ORG_NAME					    THE_SCORE
-------------------------------------------------- ----------
Philanthropic Research, Inc.				  100
Thankful Hearts Food Pantry, Inc.			   17
Decatur County Education Foundation, Inc.		   17
Wadessa Project Hope					   17
Safe Haven Project, Inc.				   17
In this example, we are looking for any row that matches the keyword "guidestar" using the interMedia index on the keywords column of the table search. The third argument to the contains function, 1, is simply a placeholder that matches the argument to the score function. You can specify multiple keywords in the query string by separating each keyword with a comma. You can also specify the relative weights of each of the keywords to affect how the overall score is computed:
SQL> select org_name, score(1) as the_score
       from search 
      where contains(keywords, 'guidestar*3,apotheosis', 1) > 0
      order by score(1) desc;

ORG_NAME					    THE_SCORE
-------------------------------------------------- ----------
Philanthropic Research, Inc.				   58
Thankful Hearts Food Pantry, Inc.			   54
Decatur County Education Foundation, Inc.		   54
Wadessa Project Hope					   54
Safe Haven Project, Inc.				   54
A Temple of the Apotheosis, Inc.			   25
Indianapolis Art Center, Inc.				    5
The interMedia query language support three boolean operators: AND, OR, and NOT (abbreviated to &, |, and ~ respectively). Note that the NOT operator is equivalent to AND NOT.
There are a few other useful interMedia operators worth mentioning:
$ - This stemming operator matches all words similar to the one specified. For example, the clause contains(keywords,'$child',1) > 0 matches all rows containing the word child or any related words such as children:
SQL> select count(*) 
       from search 
      where contains(keywords,'child',1) > 0;

  COUNT(*)
----------
       198

SQL> select count(*) 
       from search 
      where contains(keywords,'$child',1) > 0;

  COUNT(*)
----------
       590

SQL> select count(*) 
       from search 
      where contains(keywords,'$children',1) > 0;

  COUNT(*)
----------
       590
near - This operator looks for the proximity of the specified words to each other. The closer the words appear to each other, the higher the score. For example, if we wanted to find all rows matching youth soccer:
SQL> select count(*) 
       from search 
      where contains(keywords,'youth,soccer',1) > 0;

  COUNT(*)
----------
	86

SQL> select count(*) 
       from search 
      where contains(keywords,'youth NEAR soccer',1) > 0;

  COUNT(*)
----------
	 3
about - This operator takes advantage of the theme of each document to expand the number of results returned to the user. For example, using the about operator on the same query as above returns an additional eight rows:
SQL> select count(*) 
      from search 
     where contains(keywords,'about(youth,soccer)',1) > 0;

  COUNT(*)
----------
	94
Some of the folks at interMedia wrote a PL/SQL function to translate what humans write to an interMedia query string consisting of the NEAR and ABOUT operators (see the im_convert function in http://software.arsdigita.com/www/doc/sql/site-wide-search.sql). This function is fairly rough, but in many cases will suffice.
Summary
The overall process for implementing site-wide full-text search is as follows:

Create and maintain one site-wide search table of all the data to index

Create triggers on the underlying tables that simply mark the corresponding row in the site-wide search table as dirty
Regularly update the dirty rows in the site-wide search table

Create the interMedia indexes you want to use
Run ad-context-server.pl every hour as a Unix cron job or submit an Oracle job to keep the interMedia indexes up-to-date. Do not use any other method to update the index.

Future work
Currently, the most painful flaw in our use of interMedia is the amount of work it takes to correctly translate a user's input to interMedia jargon. We need to create a suite of text parsers that can be used for different situations. The site-wide search module in ACS continues to improve on the functions used to translate user-entered text into interMedia query syntax, but there is still a lot of room for improvement.

asj-editors@arsdigita.com

Reader's Comments

In these happy days of Oracle 8.1.6.2 (http://www.photo.net/scratch/ctx_patch/), it is preferable to use a dbms_job to sync the Intermedia index.

-- David Eison, January 28, 2001

Great document about interMedia presenting an introduction to interMedia and outlining many of its features.

-- Michael Bryzek, January 31, 2001

Providing Site-wide Full-Text Search with Oracle interMedia

Alternatives to Oracle interMedia

Overview

Step One: Synchronizing the data to index

Approach 1: Immediately update the data

Approach 2: Mark the row for later update

Step Two: Synchronize the interMedia indexes

Method One: Oracle Context Server

Method Two: Manually update the index

Known bugs

General Tuning Considerations

Text parsing - the interMedia query language

Summary

Future work

Reader's Comments

Related Links