Monday, February 27, 2012

Keyword Search via Oracle Text

Oracle text is a feature available in the Oracle Database and is used to provide keyword search indexing to large blocks of text and even binary formatted files like Word and PDF files.

As part of a project I am working on, I need to create a keyword search index that spans multiple columns. This will allow my users to search for keywords in the title, abstract and content of a note entered into the system. The note could be in the form of an uploaded file, or it could be manually entered through the interface.

This post will take you step-by-step through the process of building the back-end Oracle Text index and then using the Oracle SQL Contains clause to leverage that index for full text search across multiple columns.

Before we can create the search indexes, we need a table:

create table note_content ( TITLE VARCHAR2(256 BYTE) --Note Title , ABSTRACT VARCHAR2(4000 BYTE) --Short Description , TEXT_CONTENT CLOB --Content for text-based notes , FILE_CONTENT BLOB --Storage for binary file , FILE_NAME VARCHAR2(400 BYTE) --File name for binary file , FILE_MIME_TYPE VARCHAR2(256 BYTE) --Mime type for binary file , INDEX_FLAG VARCHAR2(1 BYTE) --Column used to control index --sync (more later) , NOTE_TYPE VARCHAR2(1 BYTE) --Flag indicating text/binary );


insert into note_content(title, abstract, text_content)
values ('foo', 'bar', 'this is a test');


insert into note_content(title, abstract, text_content)
values ('abc', 'def', 'foo is the source of bar');

insert into note_content(title, abstract, text_content)
values ('123', '456', '7890');

commit;


Now that we have our table and data, we need to setup the multi-column Oracle Text index. This is where the magic happens as we'll allow the users to search for keywords in the notes title, abstract or content (text or binary).

First we need to setup a multi-column data store that includes each of the columns we want to be searchable:

begin ctx_ddl.create_preference('note_multi_store','MULTI_COLUMN_DATASTORE'); ctx_ddl.set_attribute('note_multi_store', 'columns', 'title,abstract,text_content,file_content' ); end; /

After creating our data store, we need to create a CONTEXT type index. Since we'll be specifying our custom data store, we can create the index on any column we want. In this example we'll choose the INDEX_FLAG column:

CREATE INDEX note_content_search_idx ON note_content(index_flag) INDEXTYPE IS ctxsys.context PARAMETERS('DATASTORE note_multi_store sync (on commit)');


Note: Context indexes require special handling to ensure the content is indexed and available for search. While the create statement above will sync on commit (10g+), the INDEX_FLAG column must be updated as part of the transaction to flag the record for synchronization. Without updating INDEX_FLAG the sync won't happen and new content will not be searchable. We can automate this update by using a trigger:

create or replace TRIGGER note_content_audit BEFORE INSERT or update ON note_content FOR EACH ROW BEGIN :new.index_flag := 'Y'; END;
/

Now that we have the trigger, let's update our records to ensure index_flag is set and the context index is updated
update note_content
set title = title;

select title, index_flag
from note_content;


TITLE INDEX_FLAG
----------------------
foo Y abc Y 123 Y


It is also possible to manually synchronize the index or create a dbms_scheduler_job to do so on an interval. This is advisable if the sync on commit is causing performance problems, or you do not need the note content indexed immediately. It is also advisable to optimize the index periodically to  eliminate data fragmentation. The relevant commands are as follows:


--Synchronize Oracle Text indexes for full-text search using 2MB of memory CTX_DDL.SYNC_INDEX('note_content_search_idx', '2M'); --Optimize the oracle text indexes for full-text search --Full indexing (compact + remove) maximum 60 minutes each ctx_ddl.optimize_index('note_content_search_idx', 'FULL', 60);

To leverage the oracle text index we need to use the contains clause in our sql statements. We can use any column as the first argument. In this example I am going to use the same INDEX_FLAG column we are using to control synchronization. Because the Index uses our custom data store, the indicated column doesn't matter:


Select title, abstract, text_content from note_content where contains (index_flag, 'foo and bar', 1) > 0;

TITLE ABSTRACT TEXT_CONTENT
----------------------------------------- foo bar this is a test abc def foo is the source of bar

The above query will return any record where foo and bar occur in the title, abstract, text_content of file_content columns. Note that 'foo' may occur in one column while 'bar' can be in a separate column. This query also demonstrates the use of boolean AND/OR operators to target keyword searches.

This article should provide a solid foundation for using Oracle Text to implement full-text search for a table. Oracle text is a robust product that allows for a high degree of customization. It also has tools to rank results and/or highlight and display resulting documents. For a more general introduction, please see Oracle's Introduction to Oracle Text article.