Friday, April 20, 2012

Oracle Text and Friendly Search Expressions


Oracle Text is a feature of the Oracle database that allows developers to index large portions of data and provide users a search-engine like view into that data. It comes with a powerful set of functionality that allows the experienced or knowledgeable user to craft very specific queries. But what about everybody else? Most users want the search functionality to be basic and/or follow what they have become accustom to with the various online-search engines.

Today I will show you how to provide a sub-set of Oracle text features (specifically using the context index and contains clause) that mimic the major features of most online search tools. This will allow you to use Oracle Text in your application without forcing users into a new set of search expressions.

Before we get started with the search expressions, we need to build and populate a test table. This test table uses a clob, but you could also index a blob column or search against a multi-column index like the one I detail here.

create table simple_search_ex (  ident number,
content clob, );

insert into simple_search_ex (ident, content) values (1, 'This is the first row, it is red');
insert into simple_search_ex (ident, content) values (2, 'This is the second row, it is blue');
insert into simple_search_ex (ident, content) values (3, 'This is the third row, it is red');
insert into simple_search_ex (ident, content) values (4, 'This is the fourth row, it is green');

Now that we have the table built, we need to add the Oracle text index. I'm going to use the sync on commit parameter so that updating the column will update the index


create index simple_search_ex_idx on simple_search_ex(content)
indextype is ctxsys.context PARAMETERS ('SYNC ( ON COMMIT )');

Once we have the test data setup, we can get down to business. Oracle Text has a ton of complex search options and operators that you can exploit (see the documentation for the full list). To keep it simple for my users, I'm going to stick with the basics. My search will allow the user of  "AND", "OR", "NOT",  "()" and expressions in double quotes  or surrounded by braces "{}" will be evaluated as a single word. Finally spaces will translate to "AND" in my algorithm.

Limiting myself to the simple options means we can transform the user's input into the appropriate expression for the CONTAINS clause with a few regexp_replace calls. We simply want to transform the user's English search expression into one that Oracle Text will allow for. I've commented the code below, the only really tricky part is ensuring that we do not replace characters inside of quotes. That requires a few extra replacements to work properly.

Before we can create the function, we must create a pl/sql table to hold the tokenized input:


create or replace type vc_lst as table of varchar2(4000);


Now we can create the function:


create or replace function translate_search_expression(p_input varchar2) return varchar2 is v_procedure_c constant varchar2(30) := 'translate_search_expression'; v_location_i integer; -- This variable tracks our location within this procedure. v_input varchar2(31000); v_word_lst vc_lst := vc_lst(); v_return varchar2(31000); l_n number; begin -- Set Tracing Information. v_location_i := 500; dbms_application_info.set_client_info('Session="'||v('APP_SESSION')||'"'); dbms_output.put_line(v_procedure_c); v_location_i := 1000; v_return := p_input; v_location_i := 1500; --Replace "some text" with {some text} v_return := regexp_replace(v_return, '"(.*?)"', '{\1}',1,0,'i'); v_return := regexp_replace(v_return, '"', '', 1,0,'i'); v_location_i := 1600; --Replace ( and ) with ' ( ' and ' ) ' if not wrapped in spaces v_return := regexp_replace(v_return, '([^\s])\(([^\s])', '\1 ( \2',1,0); v_return := regexp_replace(v_return, '([^\s])\)([^\s])', '\1 ) \2',1,0); v_location_i := 2000; --Split the search expression on spaces loop l_n := instr( v_return, ' ' ); exit when (nvl(l_n,0) = 0); v_word_lst.extend; v_word_lst( v_word_lst.count ) := ltrim(rtrim(substr(v_return,1,l_n-1))); v_return := substr( v_return, l_n+1 ); end loop; --Replace non-supported oracle text whitespace or operators if (v_word_lst.count > 0) then for i in v_word_lst.first..v_word_lst.last loop --Underscore is seen as a wildcard usually, wrap it so it is literal if (instr(v_word_lst(i), '_') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --the dash (-) is normally minus (lowers score if second word exists) elsif (instr(v_word_lst(i), '-') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --ensure the comma is literal, not accumulate elsif (instr(v_word_lst(i), ',') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --ensure equal sign is literal not equivilance elsif (instr(v_word_lst(i), '=') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --ensure semi-colon is literal, not near elsif (instr(v_word_lst(i), ';') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --ensure explination point is literal not soundex elsif (instr(v_word_lst(i), '!') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; --ensure dolar sign is literal not stem elsif (instr(v_word_lst(i), '$') > 0) then v_word_lst(i) := '{'||v_word_lst(i)||'}'; end if; end loop; end if; v_location_i := 3000; --Put string back together with spaces v_return := ''; if (v_word_lst is not null and v_word_lst.count > 0) then for i in v_word_lst.first..v_word_lst.last loop v_return := v_return || ' ' || v_word_lst(i); end loop; --trim off excess delimiters at beginning and end v_return := ltrim(rtrim(v_return, ' '), ' '); end if; v_location_i := 3500; --Replace keyword " AND " with & unless it exists inside of {} v_return := regexp_replace(v_return, '(\{.*?\})|\s+AND\s+|&', '\1&',1,0, 'i'); --Now replace "(&" with " (" and "&)" with ") " and remove any leading or trailing spaces from the string v_return := regexp_replace(v_return, '\s*\((&|\s)*', ' (',1,0); v_return := regexp_replace(v_return, '(&|\s)*\)\s*', ') ',1,0); v_return := ltrim(rtrim(v_return, ' ')); --dbms_output.put_line('after & replacement: '||v_return); --Replace keyword " OR " with | unless it is inside {} v_return := regexp_replace(v_return, '(\{.*?\})|\s+OR\s+|\|', '\1|',1,0, 'i'); --dbms_output.put_line('after | replacement: '||v_return); --Replace keyword " NOT " with ~ unless it is inside {} v_return := regexp_replace(v_return, '(\{.*?\})|\s+NOT\s+|~', '\1~',1,0, 'i'); --dbms_output.put_line('after ~ replacement: '||v_return); --Now replace any remaining spaces with &, unless they are inside {} v_return := regexp_replace(v_return, '(\{.*?\})|\s+', '\1&',1,0, 'i'); --dbms_output.put_line('after " " replacement: '||v_return); --Here we must cleanup the condition where we'll get }~|& due to above regex v_return := regexp_replace(v_return, '}[&~\|]+&', '}&',1,0,'i'); --dbms_output.put_line('after && replacement: '||v_return); --Here we must cleanup the condition where we'll get }~| due to above regex v_return := regexp_replace(v_return, '}[~\|]+\|', '}|',1,0,'i'); --In some cases we'll see &|& or &~&, so fix those v_return := regexp_replace(v_return, '&\|&', '|',1,0,'i'); v_return := regexp_replace(v_return, '&~&', '~',1,0,'i'); --In some cases we'll see &&( or )&&, so fix those v_return := regexp_replace(v_return, '&&\(', '&(',1,0,'i'); v_return := regexp_replace(v_return, '\)&&', ')&',1,0,'i'); --In some cases we'll see |&( or )&|, so fix those v_return := regexp_replace(v_return, '\|&\|\(', '|(',1,0,'i'); v_return := regexp_replace(v_return, '\)&\|', ')|',1,0,'i'); --In some cases we'll see ~&( or )&~, so fix those v_return := regexp_replace(v_return, '~&\(', '~(',1,0,'i'); v_return := regexp_replace(v_return, '\)&~', ')~',1,0,'i'); --Finally ensure we don't have any operatiors at the beginning or end and return the expression v_return := regexp_replace(v_return, '^[~&\|]+', '',1,1,'i'); v_return := regexp_replace(v_return, '[~&\|]+$', '',1,1,'i'); return v_return; EXCEPTION when OTHERS then return p_input; END translate_search_expression;
/

The following examples demonstrate how this code will translate your users' search expressions into the actual expression.


SQL> select translate_search_expression('hello world') results from dual; RESULTS -------------------------------------------------------------------------------- hello&world
SQL> select translate_search_expression('"hello world"') results from dual; RESULTS -------------------------------------------------------------------------------- {hello world}
SQL> select translate_search_expression('this~(is history) OR demo') results from dual; RESULTS -------------------------------------------------------------------------------- this~(is&history)|demo


To use this in your query, simply use the results of the function as the search expression portion of the CONTAINS clause like so:


SQL> select ident, content from simple_search_ex where contains(content, translate_search_expression('blue or green'), 1);
IDENT CONTENT --------------- ---------------------------------------------- 2 This is the second row, it is blue 4 This is the fourth row, it is green


Now you should be able to fully enable Oracle Text searches in your pl/sql or Apex application while allowing your users to search as they are used-to, while still providing the power of AND/OR/NOT expressions. I hope you have found this useful.