by Michael Bryzek (mbryzek@arsdigita.com)
Submitted on: 2000-07-01
ArsDigita : ArsDigita Systems Journal : One article
The current Site-wide Search module in the Arsdigita
Community System (ACS) uses interMedia to index text in an arbitrary
number of tables. This module makes it easy to index additional data
in any tables in the data model, demonstrating the power of
interMedia.
Unfortunately, if interMedia is not configured correctly from the
start, it is a nightmare to administer. This article presents an
overview of our implementation of site-wide full-text search with
Oracle interMedia for http://www.guidestar.org and http://www.volunteersolutions.org,
based on our experience with what does and does not work in an
interMedia installation. In particular, we focus on maintaining the
interMedia indexes so that they are always up-to-date and never
break. We do not address setup/configuration details which are fairly
well-documented (see http://technet.Oracle.com/products/intermedia/htdocs/text_techover.html). In
most cases, installation is not difficult and requires you to source a
few files in the $ORACLE_HOME/ctx directory and to configure Net8.
In both sites, the data we wanted to index was spread out among a few
tables and columns. The implementation of full-text search consists
of the following major steps:
One rule of thumb for deciding what constitutes a small amount of work
is to count the number of rows in all tables that must be accessed to
update the corresponding row in the site-wide search table. If you
only need to access one or two rows, you should probably use this
method.
There are a couple of ways to store the dirty bit:
This solution also lends itself to a simple pipeline for keeping
the data current:
The example of the site-wide search table presented in this article
uses the direct datastore object by directly storing the text to index
in the
Most of the time, interMedia does a good job of creating
indexes. If you get errors during the build process, either you have
not properly installed interMedia or your index needs to be
dropped. Most of the errors we have seen have come from Net8
configuration errors (see http://oradoc.photo.net/ora816/network.816/a76933/preface.htm).
Performance with interMedia indexes varies widely, primarily depending
on how common the query is. For example, a rarely used query such as
"guidestar" will return quickly whereas a common word such as
"philanthropy" will take significantly longer:
There are two major problems with Oracle Context Server:
We maintain the interMedia indexes on GuideStar and Volunteer
Solutions with ad-context-server.pl, a small perl script that we run
as a
As of Oracle 8.1.6, another safe way to update the index is to submit
a
Updating the index from within an environment that is constantly
restarted, e.g. from within a web server, is a bad idea. On
GuideStar, we initially created a simple procedure that ran every hour
from within the web server and simply executed the alter index
statement. One night we decided to explicitly update the index:
You can also look at interMedia errors to find out what went wrong by
opening a SQL*Plus session as the
The
This error occurred because the hourly procedure running inside our
web server to update the interMedia index started up in parallel to
the one we manually executed in SQL*Plus. These two threads competed
for some resource, and ended up in a deadlock. We have been able to
replicate this same error in two other ways:
When manually altering interMedia indexes, you have to make sure that
no more than one
The people at interMedia tell us that the deadlock should not happen,
and we agree! However, the reality of the deadlock gives us the
following interMedia law: Use exactly one method to update interMedia
indexes.
We had many problems running interMedia in Oracle 8.1.5 and
8.1.5.1. On GuideStar, we had to restart the Oracle database every 1-2
nights just to maintain an adequate level of performance. We never
managed to fully trace down the problem, but we were getting a lot of
loose processes that simply bogged down the database.We were also
constantly (twice a day at one point!) seeing ORA-600 errors:
As you continue to update your interMedia indexes, the tables
interMedia maintains become more and more fragmented. You may
want to optimize your index by executing the following statement in
SQL*Plus (after removing the ad-context-server.pl from your
Another consideration is the layout of the data files used to store
the site-wide search table and the interMedia indexes. Ideally, you
will want to put your denormalized site-wide search table in its own
tablespace on its own mirrored drive and use another tablespace on
another mirrored drive to store the indexes. Separating the indexes
and the data will minimize I/O contention leading to better
performance.
The basic syntax is to add
The interMedia query language support three boolean operators:
There are a few other useful interMedia operators worth mentioning:
Some of the folks at interMedia wrote a PL/SQL function to translate
what humans write to an interMedia query string consisting of the
Last updated: 2000-09-08
Oracle interMedia provides full-text indexing of data in an Oracle
database. For an Oracle-backed website, interMedia allows users to
search the content of the site by keywords, a key feature for most web
services. For programmers, interMedia provides a tightly coupled SQL
interface for writing search queries.
Alternatives to Oracle interMedia
Full-text search products can be placed into two broad categories:
When using the Oracle RDBMS, interMedia provides the following advantages:
Overview
We implemented site-wide full-text search on both GuideStar.org, a site that lists
financial information on over 700,000 American non-profit
organizations, and VolunteerSolutions.org, a
much smaller site that provides volunteers with information about local
opportunities. Both of the sites frequently update the information
associated with each record, requiring us to constantly update the
indexes used for full-text searching. Most of the code samples we
present in this article are taken from the Volunteer Solutions site as
the logic is simpler and the code is easier to understand.
Step One: Synchronizing the data to index
We need to synchronize the data we want to index as we have multiple
sources (e.g. the data that we want to index is dispersed through
several tables). The first step is to create a central, site-wide
search table. We need to include at least one column containing all
the text we want to index and we should include extra information that
we will use when presenting results to the user. Placing this
information in the same table is important for the following reasons:
-- VolunteerSolutions.org site-wide search table
create table search (
-- the primary key points back to the detailed opportunity pages
opp_id integer
constraint search_opp_id_pk primary key
constraint search_opp_id_fk references vs_opp,
-- the agency that is sponsoring this volunteer opportunity
agency_id constraint search_agency_id_fk references vs_agency
constraint search_agency_id_nn not null,
-- basic user group information to save a join with user groups!
group_id constraint search_group_id_fk references user_groups,
parent_group_id constraint search_parent_group_id_fk references user_groups,
-- different ways of looking at the sponsoring agency's name
agency_name varchar(150),
agency_name_sort_key varchar(150),
agency_name_initial char(1),
-- where is this opportunity located?
city varchar(100),
state varchar(2),
-- first 5 digits of the zip code
zip varchar(5),
-- is this a one-time or ongoing opportunity
one_time_p char(1),
-- latitude and longitude for zipcode/radius searching
latitude number(9,6),
longitude number(9,6),
-- different ways of looking at the name of this volunteer opportunity
headline varchar(200),
headline_sort_key varchar(200),
headline_initial varchar(200),
-- is this opportunity private to the specified group_id?
private_p char(1),
-- foreing URL describing the opportunity (or agency) in more detail
url varchar(100),
-- the text we want to index
keywords clob,
-- a description of the opportunity to show users
description clob,
-- one clob for each category type (taken from the categories table)
-- this lets us search for keywords within a category. Note that
-- we could have created the same functionality using interMedia's
-- Section Groups
category_group clob,
category_interest clob,
category_skill clob,
category_time clob,
category_interaction clob
);
Next, we need to collect the data we want to index from within the
data model. On both GuideStar and Volunteer Solutions, we wanted to
index text stored in approximately ten columns throughout the data
model. We created PL/SQL procedures to combine the data we wanted to
index from all the tables into the single column named
keywords in the table search. The following
code sample is the portion of the PL/SQL procedure from
VolunteerSolutions.org that updates most of the columns in the
search table.
update search
set agency_id=v_agency_id,
group_id=v_group_id,
parent_group_id=v_parent_group_id,
zip=v_zip,
city=v_city,
state=v_state,
one_time_p=v_one_time_p,
latitude=v_latitude,
longitude=v_longitude,
headline=v_headline,
headline_sort_key=v_headline_sort_key,
headline_initial=v_headline_initial,
url=v_url,
agency_name=v_agency_name,
agency_name_sort_key=v_agency_name_sort_key,
agency_name_initial=v_agency_name_initial,
private_p=v_private_p,
keywords=empty_clob(),
description=empty_clob(),
category_group=empty_clob(),
category_interest=empty_clob(),
category_skill=empty_clob(),
category_time=empty_clob(),
category_interaction=empty_clob()
where opp_id=v_opp_id;
-- now update all the lobs
select keywords, description, category_group, category_interest,
category_skill, category_time, category_interaction
into v_keywords, v_description, v_category_group, v_category_interest,
v_category_skill, v_category_time, v_category_interaction
from search where opp_id=v_opp_id for update;
-- pull out the description (also a lob)
select description into v_description_temp from vs_opp where opp_id=v_opp_id;
-- set up the new keywords
dbms_lob.writeappend(v_keywords, length(v_buffer), v_buffer);
dbms_lob.writeappend(v_keywords, 1, ' ');
dbms_lob.append(v_keywords, v_description_temp);
dbms_lob.append(v_description, v_description_temp);
Once the site-wide search table is populated, we need to ensure that
future inserts, updates, and deletes to the underlying data are
reflected in the information kept in the site-wide search table. There
are two basic approaches to synchronizing the data stored in the
site-wide search table:
Approach 1: Immediately update the data
If the amount of work we need to do to update an individual row of
data in the site-wide search table is small, immediately updating the
data in the site-wide search table makes sense. To implement immediate
updates, create row-level triggers on all the tables Oracle has to
access when it updates the corresponding data in the search
index. Whenever an underlying table is modified, a trigger will fire,
immediately updating, or removing, the information in the site-wide
search table. We use this approach in ACS to keep static-pages
up-to-date:
create or replace trigger static_pages_sws_insert_tr
after insert on static_pages for each row
WHEN (NEW.index_p = 't')
BEGIN
-- we have to create a new row in the index table for this row.
insert into site_wide_index
(table_name, the_key, one_line_description, datastore)
values
('static_pages', :new.page_id, :new.page_title, 'a');
END;
Approach 2: Mark the row for later update
The second approach relies on triggers to mark a row in the site-wide
search table as dirty. A scheduled process will later update the data
in the row. This is advantageous as it allows for batch processing of
dirty rows which may decrease the total time needed to update the
rows. By updating the index in batches, we can improve the overall
quality of the index, since interMedia will reduce index fragmentation
by grouping similar words together during the index process.
The method using dirty_p column in the site-wide
search table. Though this is a simple solution, it will force a write
to the site-wide search table, decreasing overall performance. The key
to maintaining top search performance is to ensure that any
book-keeping we do is separate from the data associated with our
site-wide-index. Since the dirty_p column is in the same
tablespace and datafile as the rest of the site-wide search table,
updating this column will force an unnecessary disk access to the same
drive that holds your site-wide search table.
create table search_dirty_rows (
on_which_table varchar(50) not null,
on_what_id integer not null,
constraint sdr_which_table_what_id_pk
primary key(on_which_table, on_what_id),
creation_date date default sysdate
);
Note that you may consider removing the primary key
constraint to improve write performance to this table. We recommend
the constraint as it leads to cleaner code to update the corresponding
data in the search index as we do not neet to worry about updating the
same data twice.
create or replace trigger vs_agency_tr
before insert or update on vs_agency
for each row
DECLARE
v_exists_p integer;
BEGIN
select decode(count(*),0,0,1) into v_exists_p
from search_dirty_rows
where on_which_table='vs_agency'
and on_what_id=:new.agency_id;
if v_exists_p = 0 then
BEGIN
-- We have to catch the case where another row is created
-- between the time we've counted and are going to insert
-- Note that the exception means the row already exists
-- so we don't have to do anything if we catch the exception
insert into search_dirty_rows
(on_which_table, on_what_id)
values
('vs_agency', :new.agency_id);
EXCEPTION WHEN OTHERS THEN NULL;
END;
end if;
END;
search_dirty_rows leads to code that is
easier to understand and simpler to maintain. It is also easy to add
data from new tables to the site-wide search table, since the procedures
to keep the data up-to-date are independent.
Step Two: Synchronize the interMedia indexes
Now that we have created a site-wide search table, we need to tell
interMedia what text to index. interMedia's datastore class is
responsible for translating any data into text that can be indexed
(the data itself can be stored as a blob, varchar2, or clob with clob
being the default). The datastore class is made up of the following
objects:
The direct and user datastore objects are the most useful for
site-wide full-text searching. Using the direct datastore object may
require additional storage space as we store a duplicate copy of the
text to index in a column of our site-wide search table. The primary
advantage of the direct datastore is that we can simply query the
table to see what text is being indexed. With the user datastore
object, we have to execute the specified procedure to see what text is
being indexed.
keywords column of the table search.
To index the text in this column, we execute:
SQL> create index search_keywords_im_idx
on search(keywords)
indextype is ctxsys.context;
This command creates an interMedia index on
search(keywords). Note that the "im" in the name of the
index simply lets us quickly see that this is an interMedia index. The
amount of time it takes to create the index depends directly on the
hardware you are using and the amount of data that you are
indexing. On GuideStar, it takes about 2 1/2 hours to index
approximately 1gb of data on a Sun E450 with 4 processors and 4gb of
RAM. Note that in order to create interMedia indexes, the oracle
user must have been granted the ctxapp role.
SQL> select count(*)
from search
where contains(keywords, 'guidestar', 1) > 0
order by score(1);
2 3 4
COUNT(*)
----------
5
Elapsed: 00:00:00.02
SQL> select count(*)
from search
where contains(keywords, 'philanthropy', 1) > 0
order by score(1);
2 3 4
COUNT(*)
----------
32517
Elapsed: 00:00:01.99
Once we have built the interMedia index, we need to keep it
synchronized with the data in the site-wide search table. Unlike
Oracle B-tree indexes, interMedia indexes are not updated during the
execution of SQL transactions. There are two ways to update the
interMedia indexes:
ctxsrv)
Method One: Oracle Context Server
Oracle Context Server immediately updates an interMedia index after
any change to the underlying data. This process frees the user from
having to think about updating the interMedia indexes as the index
updates are transparent. To start context server, execute the
following from the Unix shell:
> ctxsrv -user ctxsys/ctxsyspwd -personality M -log $ORACLE_HOME/ctx/ctxsys.log &
where ctxsyspwd is the password for the Oracle user ctxsys.
Method Two: Manually update the index
The second method is to manually update the index on a regular
basis. The simplest way to update an interMedia index is to open a
SQL*Plus session as the user who created the index. Then:
SQL> alter index search_keywords_im_idx rebuild online
parameters('sync memory 45M');
Index altered.
The parameters of this call:
"Manually" updating the index is the preferred method. The best way to
do this is to run a script every hour (or however often you want to
update your index) that executes the appropriate ALTER INDEX
statements. The script should run from a robust environment that is
relatively safe from interruptions such as a Unix rebuild online: Rebuild the index incrementally so that it is still accessible.
If you forget to specify "online," users searching the Web site would not
be able to access the index.
sync: Just update the index (rather than rebuild it entirely)
memory 45M: Specify the amount of memory, in megabytes, that interMedia can
use while rebuilding the index. Using more memory decreases the amount of
disk I/O and thus increases the speed with which the index can be rebuilt.
The default is 12 megabytes and the acceptable range of inputs for memory
is 1 KB to about 50 MB. To increase the maximum allowable index memory, execute (as the ctxsys user):
SQL> exec ctx_adm.set_parameter('max_index_memory', '1G');
cron job or as an
Oracle DBMS job.
cron job (see http://software.arsdigita.com/bin/ad-context-server.pl). This
script uses Perl DBI and DBD-Oracle to connect as the
ctxsys user, reads a list of all indexes that need to be
updated (SQL> select PND_INDEX_NAME, count(*) from
ctx_user_pending group by PND_INDEX_NAME;), and executes the
alter index statements. Using this script, we have run both GuideStar
and Volunteer Solutions with no hitches for over half a year now.
DBMS_JOB to issue the update statement. The webmail
module in the ACS uses this method:
-- Resync the interMedia index every hour.
DECLARE
v_job number;
BEGIN
dbms_job.submit(v_job,
'ctx_ddl.sync_index(''search_keywords_im_idx'');',
interval => 'sysdate + 1/24');
END;
SQL> alter index search_keywords_im_idx rebuild online
parameters('sync memory 45M');
Two hours later, all keyword searches on the Web site were
failing. Users were getting back a page with zero results, regardless
of their query. We exectued another alter index statement
and got back an error saying that the index was marked as
FAILED/LOADING. Had we been smart, we would have logged what was
going on by executing, as the ctxsys user:
SQL> exec ctx_output.start_log('better_days');
to log the progress of the alter index statement. This
command would have logged all errors and notices to
$ORACLE_HOME/ctx/log/better_days.
ctxsys user and
running select err_timestamp, err_index_name, err_text from ctx_index_errors order by err_timestamp; to get a list of all the errors on interMedia indexes, sorted by date.
resume parameter to the alter index
statement is designed to finish updating an index:
SQL> alter index search_keywords_im_idx rebuild online
parameters('resume memory 45M');
The resume statement has never worked for us but maybe you will have
better luck. Probably, you will just see the same error message you had
before. Somehow the index has become corrupted:
SQL> drop index search_keywords_im_idx force;
SQL> create index search_keywords_im_idx on search(keywords)
indextype is ctxsys.context;
And 2 1/2 hours later, the index was re-created and back
on-line. Meanwhile, our Web site was not serving search results for
over three hours.
Note that an uncommitted session in SQL*Plus can keep the alter index
statement from completing. Also, the probability of deadlock is
directly related to the amount of time it takes for your index to be
updated. If you update your index once an hour, and the process only
takes a couple minutes to finish, it is unlikely that you will ever
see a deadlock. On GuideStar, some index updates would take a few
hours making it easy to end up with a deadlock.
alter index statement is running at any
given time. More than one statement could result in a deadlock inside
interMedia that can only be solved by dropping and recreating the
index.
Known bugs
Oracle 8.1.5 has a bug in the parser for the about
clause. The bug seems to incorrectly map some multi word queries to
their corresponding tokens in the interMedia tables. For example,
about('massachusetts bay'), which is tokenized as
'Massachusetts Bay', is parsed as 'Massachusetts
bay', not matching any row in the interMedia tokens table and
thereby throwing an error. If you find yourself in this situation,
you can remove the about clauses in your queries.
00600, 00000, "internal error code, arguments: [%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]"
// *Cause: This is the generic internal error number for Oracle program
// exceptions. This indicates that a process has encountered an
// exceptional condition.
// *Action: Report as a bug - the first argument is the internal error number
All of these problems went away when we upgraded to Oracle 8.1.6. If
you are going to use interMedia, we do not recommend using any release
prior to Oracle 8.1.6. Oracle has also released patches for 8.1.6 to
address a few initial problems, such as theme-based queries. These
patches should be applied before using interMedia.
General Tuning Considerations
On the web, we are often interested in getting back the first few rows
that match a query as fast as possible (as opposed to generating a
list of all the matches). As with other Oracle SQL queries, interMedia
makes use of the FIRST_ROWS hint to optimize the query to
return the first rows in the shortest amount of time. See http://technet.Oracle.com/products/intermedia/htdocs/text_techover.html#query_tuning
for more information about query optimization.
crontab - no sense risking a deadlock!):
SQL> alter index search_keywords_im_idx rebuild online
parameters('optimize fast');
Optimization is an extremely slow process and it is not clear that the
performance gain of an optimized index will be noticeable. If you are
going to optimize, you should limit the amount of time that the
optimization will take to, for example, 180 minutes:
SQL> alter index search_keywords_im_idx rebuild online
parameters('optimize full maxtime 180');
The interMedia documentation (http://oradoc.photo.net/ora816/inter.816/a77061/ind10.htm)
has more information about optimizing indexes.
Text parsing - the interMedia query language
To use interMedia full-text search, you must learn interMedia's query
language. Here we present the most commonly used functions in terms of
full-text search on the web.
contains and
score clauses to your SQL query. An example is the
simplest way to learn this syntax:
SQL> select org_name, score(1) as the_score
from search
where contains(keywords, 'guidestar', 1) > 0
order by score(1) desc;
ORG_NAME THE_SCORE
-------------------------------------------------- ----------
Philanthropic Research, Inc. 100
Thankful Hearts Food Pantry, Inc. 17
Decatur County Education Foundation, Inc. 17
Wadessa Project Hope 17
Safe Haven Project, Inc. 17
In this example, we are looking for any row that matches the keyword
"guidestar" using the interMedia index on the keywords
column of the table search. The third argument to the
contains function, 1, is simply a
placeholder that matches the argument to the score
function. You can specify multiple keywords in the query string by
separating each keyword with a comma. You can also specify the
relative weights of each of the keywords to affect how the overall
score is computed:
SQL> select org_name, score(1) as the_score
from search
where contains(keywords, 'guidestar*3,apotheosis', 1) > 0
order by score(1) desc;
ORG_NAME THE_SCORE
-------------------------------------------------- ----------
Philanthropic Research, Inc. 58
Thankful Hearts Food Pantry, Inc. 54
Decatur County Education Foundation, Inc. 54
Wadessa Project Hope 54
Safe Haven Project, Inc. 54
A Temple of the Apotheosis, Inc. 25
Indianapolis Art Center, Inc. 5
AND, OR, and NOT (abbreviated
to &, |, and ~ respectively).
Note that the NOT operator is equivalent
to AND NOT.
$ - This stemming operator matches all words
similar to the one specified. For example, the clause
contains(keywords,'$child',1) > 0 matches all rows
containing the word child or any related words such as children:
SQL> select count(*)
from search
where contains(keywords,'child',1) > 0;
COUNT(*)
----------
198
SQL> select count(*)
from search
where contains(keywords,'$child',1) > 0;
COUNT(*)
----------
590
SQL> select count(*)
from search
where contains(keywords,'$children',1) > 0;
COUNT(*)
----------
590
near - This operator looks for the proximity of
the specified words to each other. The closer the words appear to
each other, the higher the score. For example, if we wanted to find all rows matching youth soccer:
SQL> select count(*)
from search
where contains(keywords,'youth,soccer',1) > 0;
COUNT(*)
----------
86
SQL> select count(*)
from search
where contains(keywords,'youth NEAR soccer',1) > 0;
COUNT(*)
----------
3
about - This operator takes advantage of the theme of each document to expand the number of results returned to the user. For example, using the about operator on the same query as above returns an additional eight rows:
SQL> select count(*)
from search
where contains(keywords,'about(youth,soccer)',1) > 0;
COUNT(*)
----------
94
NEAR and ABOUT operators (see the
im_convert function in http://software.arsdigita.com/www/doc/sql/site-wide-search.sql). This
function is fairly rough, but in many cases will suffice.
Summary
The overall process for implementing site-wide full-text search is as follows:
Future work
Currently, the most painful flaw in our use of interMedia is the
amount of work it takes to correctly translate a user's input to
interMedia jargon. We need to create a suite of text parsers that can
be used for different situations. The site-wide search module in
ACS continues to improve on the functions used to translate
user-entered text into interMedia query syntax, but there is still a
lot of room for improvement.
asj-editors@arsdigita.com
Reader's Comments
In these happy days of Oracle 8.1.6.2 (http://www.photo.net/scratch/ctx_patch/), it is preferable to use a dbms_job to sync the Intermedia index.
-- David Eison, January 28, 2001
Great document about interMedia presenting an introduction to interMedia and outlining many of its features.
-- Michael Bryzek, January 31, 2001Related Links