What do developers usually think about Oracle's hierarchical queries?
You're wrong with all these assumed limitions: You can use joins and order by with connect by queries. And you can make them fast.
Let's look at an example. I chose the categories and category_hierarchy tables from the ArsDigita Community System (ACS). See the doc for it at "User Profiling and Content Categorization Module".
Here is the data model:
create table categories ( category_id integer not null primary key, category varchar(50) not null, category_description varchar(4000), profiling_weight number default 1 check(profiling_weight >= 0), enabled_p char(1) default 't' check(enabled_p in ('t','f')), mailing_list_info varchar(4000) ); create table category_hierarchy ( parent_category_id integer references categories, child_category_id integer references categories unique (parent_category_id, child_category_id) );
The categories table contains the category descriptions. As each object can have more than one parent, we have to store data about the hierarchy in a different table: category_hierachy. Note that this is not a tree hierarchy, but it's a net hierarchy.
I got my test data from the real-life ShareNet System: The categories table has 522 rows and the category_hierarchy has 547 rows. So let's get started.
Even the Oracle Documentation claims: If you specify a hierarchical query, the same statement cannot also perform a join.
We want to show the category hierarchy with proper indentation, so we have to "connect" the data from the tables categories and category_hierachy in some way. Let's try:
select lpad(' ',level*3) || category cat, parent_category_id p_id, child_category_id ch_id, rownum ord, level lev from categories b, category_hierarchy a where a.child_category_id= b.category_id start with parent_category_id is null connect by prior child_category_id = parent_category_id order by ord ; ERROR at line 2: ORA-01437: cannot have join with CONNECT BY
But we can do it within a subquery:
select lpad(' ',lev*3) || category cat, a.p_id, a.ch_id from categories b, (select parent_category_id p_id, child_category_id ch_id, rownum ord, level lev from category_hierarchy start with parent_category_id is null connect by prior child_category_id = parent_category_id) a where a.ch_id= b.category_id order by ord ;
The lpad function makes the query more expensive, but I want only the execution plan and time for the query itself, so I'll remove it for the rest of this article.
select category cat from categories b, (select parent_category_id p_id, child_category_id ch_id, rownum ord, level lev from category_hierarchy start with parent_category_id is null connect by prior child_category_id = parent_category_id) a where a.ch_id= b.category_id order by ord ; 3037 rows selected. Elapsed: 00:00:03.05 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=5 Card=9 Bytes=387) 1 0 SORT (ORDER BY) (Cost=5 Card=9 Bytes=387) 2 1 HASH JOIN (Cost=3 Card=9 Bytes=387) 3 2 VIEW (Cost=1 Card=9 Bytes=234) 4 3 COUNT 5 4 CONNECT BY 6 5 TABLE ACCESS (FULL) OF 'CATEGORY_HIERARCHY' (Cost=1 Card=19 Bytes=57) 7 5 TABLE ACCESS (BY USER ROWID) OF 'CATEGORY_HIERARCHY' 8 5 INDEX (FAST FULL SCAN) OF 'CH_CP' (UNIQUE) (Cost =1 Card=9 Bytes=54) 9 2 TABLE ACCESS (FULL) OF 'CATEGORIES' (Cost=1 Card=522 Bytes=8874)
The Oracle Documentation states: If you specify the order_by_clause, it takes precedence over any ordering specified by the hierarchical query.
First, let's look at a simple case:
create table hierarchy_data ( id integer primary key, name varchar2(20), parent_id integer ); insert into hierarchy_data values(1, 'Support team', 5); insert into hierarchy_data values(2, 'Munich branch', 3); insert into hierarchy_data values(3, 'FooBar Inc.', null); insert into hierarchy_data values(4, 'Sales team', 2); insert into hierarchy_data values(5, 'Boston branch', 3); insert into hierarchy_data values(6, 'Developers team', 2); insert into hierarchy_data values(7, 'Research team', 5); commit;
We want to write a "select" statement that shows the hierarchy, and we want to sort each node's childs alphabetically:
FooBar Inc. Boston branch Research team Support team Munich branch Developers team Sales team
Now let's try:
select lpad(' ',(level-1)*3)||name name from hierarchy_data start with parent_id is null connect by prior id = parent_id order by name ; NAME -------------------------------------------------------------------------------- Developers team Research team Sales team Support team Boston branch Munich branch FooBar Inc. 7 rows selected.
So - as expected - the "order by" clause breaks the order of rows created by the "connect by" clause. So, let's do it another way
create index hierarchy_data_ind on hierarchy_data (parent_id, name); select /*+INDEX(hierarchy_data hierarchy_data_ind)*/ lpad(' ',(level-1)*3)||name name from hierarchy_data start with parent_id is null connect by prior id = parent_id ; NAME ---------------------- FooBar Inc. Boston branch Research team Support team Munich branch Developers team Sales team 7 rows selected.
Now the sql result set is properly ordered. The INDEX hint explicitly chooses an index scan for the table and also asks the Oracle SQL engine to retrieve rows from the table according to their order in the index.
As you can see in the above case, data about the ordering and the hierarchy information are both stored in the same table. But, what if this data is spread across more than one table ? For instance, like in the categories and the category_hierarchy tables, our first example. Here we have two possible solutions:
But before we are trying those two possible solutions, let's check how fast Oracle finds roots of trees.
set autotrace on exp select 1 from category_hierarchy where parent_category_id is null; ... ... 19 rows selected. Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=19 Bytes=114) 1 0 TABLE ACCESS (FULL) OF 'CATEGORY_HIERARCHY' (Cost=1 Card=19 Bytes=114)
A Full Table Scan usually means a perfomance decrease. Why doesn't Oracle use an index to retrieve the roots in this case? Well, we have null values stored in the parent_category_id column, and null values usually aren't included in the index. So we replace the null values of each root node with "-1".
update category_hierarchy set parent_category_id = -1 where parent_category_id is null; commit; select 1 from category_hierarchy where parent_category_id =-1; 19 rows selected. Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=9 Bytes=27) 1 0 INDEX (FAST FULL SCAN) OF 'CATEGORY_HIERARCHY_PK' (UNIQUE) (Cost=1 Card=9 Bytes=27)
Now Oracle can use the index. For small tables this doesn't make much of a difference, but with large amounts of data it becomes very important.
Let's define a new table that contains the parent_category_id, the child_category_id and the category. By the way, we deliberately insert the data in wrong order in the following create table statement (i.e descending by name) to be sure that the select statements work properly.
create table category_hierarchy_den (parent_category_id, child_category_id, category, constraint category_hierarchy_den_pk primary key (parent_category_id, category) ) as select parent_category_id, child_category_id ,category from category_hierarchy a, categories b where a.child_category_id = b.category_id order by b.category desc ;
We define the primary key ordered by parent_category_id and category, so we don't need to define an additional index for sorting. We also have to create triggers on the categories table to synchronize values with the category column of the category_hierachy table.
Now let's execute the query and look at the perfomance:
select /*+INDEX(CATEGORY_HIERARCHY_DEN CATEGORY_HIERARCHY_DEN_PK)*/ category cat from category_hierarchy_den start with parent_category_id =-1 connect by prior child_category_id = parent_category_id ; ....... ....... 3037 rows selected. Elapsed: 00:00:00.68 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 CONNECT BY 2 1 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_DEN_PK' (UNIQUE) 3 1 TABLE ACCESS (BY USER ROWID) OF 'CATEGORY_HIERARCHY_DEN' 4 1 TABLE ACCESS (BY INDEX ROWID) OF 'CATEGORY_HIERARCHY_DEN' 5 4 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_DEN_PK' (UNIQUE)
The query result is ordered as wanted, because the query uses the CATEGORY_HIERARCHY_DEN_PK index.
To further improve the perfomance, we can use the fact that the index can also contain the child_category_id column. Oracle will read the values for this column directly from the index and use it within joins. Let's look at the difference:
alter table CATEGORY_HIERARCHY_DEN drop constraint CATEGORY_HIERARCHY_DEN_PK; alter table CATEGORY_HIERARCHY_DEN add constraint CATEGORY_HIERARCHY_DEN_PK primary key (parent_category_id, category, child_category_id); select /*+CATEGORY_HIERARCHY_DEN CATEGORY_HIERARCHY_DEN_PK*/ category cat from category_hierarchy_den start with parent_category_id =-1 connect by prior child_category_id = parent_category_id ; ....... ....... 3037 rows selected. Elapsed: 00:00:00.53 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 CONNECT BY 2 1 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_DEN_PK' (UNIQUE) 3 1 TABLE ACCESS (BY USER ROWID) OF 'CATEGORY_HIERARCHY_DEN' 4 1 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_DEN_PK' (UNIQUE)
So we save one additional request, because Oracle now retrieves the value for the child_category_id directly from the index and does not have to look up the table anymore.
This feature gives us the ability to create indexes on functions and expressions that involve one or more columns in the table being indexed. A function-based index precomputes the value of the function or expression and stores it in the index.
First of all, we create a function which retrieves the category for a given category_id
create or replace function get_category(id integer) return varchar2 deterministic is val varchar2(50); begin select category into val from categories where category_id = id; return val; exception when others then return ' '; end; /The DETERMINISTIC keyword ensures that the function reliably returns the same result value whenever it is called with the same arguments. This behaviour is required by functional indexes. Let's create the functional index:
create index category_hierarchy_func_ind on category_hierarchy (parent_category_id, substr(get_category(child_category_id),1,50), child_category_id);
We have to use the substr function in this create index statement, because functions that return string values, return by default Varchar2(4000). This is too big for an index because an index key cannot exceed 1/3 of the database block size. So, after creating the functional index, let's collect statistics on both the index and its base table using the ANALYZE statement. Oracle cannot use the function-based index until these statistics have been generated.
Analyze table category_hierarchy compute statistics;
Now let's write the "ordered" hierarchical query:
select category cat from categories b, (select /*+INDEX(category_hierarchy, category_hierarchy_func_ind)*/ parent_category_id p_id, child_category_id ch_id, rownum ord, level lev from category_hierarchy start with parent_category_id =-1 connect by prior child_category_id = parent_category_id ) a where a.ch_id= b.category_id order by ord ; 3037 rows selected. Elapsed: 00:00:00.58 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=9 Bytes=387) 1 0 SORT (ORDER BY) (Cost=6 Card=9 Bytes=387) 2 1 HASH JOIN (Cost=4 Card=9 Bytes=387) 3 2 VIEW (Cost=2 Card=9 Bytes=234) 4 3 COUNT 5 4 CONNECT BY 6 5 INDEX (FAST FULL SCAN) OF 'CATEGORY_HIERARCHY_FUNC_IND' (NON-UNIQUE) (Cost=1 Card=9 Bytes=27) 7 5 TABLE ACCESS (BY USER ROWID) OF 'CATEGORY_HIERARCHY' 8 5 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_FUNC_IND' (NON-UNIQUE) (Cost=2 Card=9 Bytes=54) 9 2 TABLE ACCESS (FULL) OF 'CATEGORIES' (Cost=1 Card=522 Bytes=8874)
Here's another nice thing about functional indexes: We can get the category column value from the index, so we don't need to perform a subquery. But we need to set the parameters QUERY_REWRITE_ENABLED=TRUE and QUERY_REWRITE_INTEGRITY=TRUSTED in the database init file or set them on a session level. The first parameter allows Oracle to rewrite a query using a functional index, so the
substr(get_category(child_category_id),1,50)in the result set is not computed, but directly taken from the functional index we've just defined. The other parameter tells Oracle that this index's function is indeed deterministic.
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE; ALTER SESSION SET QUERY_REWRITE_INTEGRITY=TRUSTED;
Now, let's run the query
select /*+INDEX(category_hierarchy, category_hierarchy_func_ind)*/ substr(get_category(child_category_id),1,50) from category_hierarchy start with parent_category_id =-1 connect by prior child_category_id = parent_category_id ; ..... ..... 3037 rows selected. Elapsed: 00:00:00.50 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=HINT: RULE (Cost=2 Card=9 Bytes=54) 1 0 CONNECT BY 2 1 INDEX (FAST FULL SCAN) OF 'CATEGORY_HIERARCHY_FUNC_IND' (NON-UNIQUE) (Cost=1 Card=9 Bytes=27) 3 1 TABLE ACCESS (BY USER ROWID) OF 'CATEGORY_HIERARCHY' 4 1 INDEX (RANGE SCAN) OF 'CATEGORY_HIERARCHY_FUNC_IND' (NON-UNIQUE) (Cost=2 Card=9 Bytes=54)
Everything looks fine, doesn't it ? But, let's take a closer look at the DETERMINISTIC function requirements. What happens if the category column value is changed:
update categories set category = 'bla bla' where category_id = &id; select count(*) from category_herarchy where child_category_id = &id;
I tried this on Oracle 8.1.6 and you'll probably get this nice Oracle error:
ORA-08102: index key not found, obj# 100010, dba 117444338 (2)
What says the Oracle Error Message Reference about this error?
ORA-08102 index key not found, obj# string, dba string (string) Cause: This is an internal error; possible inconsistency in index. Action: Send trace file to Oracle Customer Support, along with information on reproducing the error.
So, it seems like our function breaks the deterministic rule after updating the category column. But, we can do the following
It works because you can always change the function result for values which are not used in the index. So when you delete the appropriate rows from category_hierarchy, the appropriate values will disappear from the functional index. So, let's write a procedure which does that:
create or replace procedure update_category_name (id integer, name varchar2) is Type Tchilds is table of category_hierarchy.child_category_id%type; Type Tparents is table of category_hierarchy.parent_category_id%type; hchilds Tchilds; hparents Tparents; begin -- put rows related with category which should be changed into PL/SQL collections select parent_category_id, child_category_id BULK COLLECT into category_hierarchy_row.hparents, category_hierarchy_row.hchilds from category_hierarchy where child_category_id = id ; -- delete rows related with category which should be changed delete category_hierarchy where child_category_id = id; -- change category name update categories set category = name where category_id = id; -- restore rows related with category FORALL i in 1..category_hierarchy_row.hparents.last insert into category_hierarchy values(category_hierarchy_row.hparents(i), category_hierarchy_row.hchilds(i)); end; /Note: To understand the "Bulk Collect" clause and the "Forall" statement see Oracle PL/SQL User's Guide and Reference, "Collections and Records" chapter.
This article shows two ways to overcome some restrictions of hierarchical queries, some examples of new Oracle features, and the effects of proper tuning. So let's look at the result
Perfomance estimation:
Original Query | Denormalization | Functional Index | |
---|---|---|---|
Execution time (sec) | 03.05 | 00.53 | 00.50 |
In terms of execution time there's not much difference between the Denormalization case and the Functional Index case.
In the Denormalization case you don't have to care about the deterministic rule of the update procedure, but you have to create additional triggers, you have to add a column for every sorting criteria and you have to create appropriate indexes.
In the Functional Index case the datamodel remains unchanged, but you have to care that the update process does not break the deterministic rule. You can create additional indexes for every sorting criteria without denormalization, because you can retrieve the necessary values directly from the index.