Monday, February 15, 2010

utl_match package for string matching

There are four functions included in the package utl_match use different methods to compare a source string and destination string, and return an assessment of what it would take to turn the source string into the destination string.
Source: $ORACLE_HOME/rdbms/admin/utlmatch.sql

Oracle added the UTL_MATCH package in Version 10gRelease 2 to compare strings. The four functions included in the package use different methods to compare a source string and destination string, and return an assessment of what it would take to turn the source into the destination string. The functions are broken down into two categories. The categories are actually the algorithms employed to analyze the strings.

Levenshtein Distance


The Levenshtein Distance (LD) algorithm, commonly called the Edit Distance (ED) algorithm, is the older of the two supported methods. It measures the distance between the source and destination strings. By distance, we’re referring to the number of changes required to turn the source string into the destination string.

Jaro-Winkler


The Jaro-Winkler algorithm is the second category of algorithms used in UTL_ MATCH. These functions take the same two arguments, but instead of simply calculating the number of steps required to change the source string to the destination string, it determines how closely the two strings agree with each other. The algorithm also tries to take into account the possibility of a data entry error when determining similarity.

Procedures/Functions


edit_distance

Returns the number of changes required to turn the source string into the destination string using the Levenshtein Distance algorithm.
function edit_distance returns binary_integer (
s1    in        varchar2,
s2    in        varchar2       
);

edit_distance_similarity

Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match.
function edit_distance_similarity returns binary_integer (
s1    in        varchar2,
s2    in        varchar2       
);

jaro_winkler

Instead of simply calculating the number of steps required to change the source string to the destination string returns similarity based on Jaro-Winkler distance algorithm, determines how closely the two strings agree with each other and tries to take into account the possibility of a data entry error.
function jaro_winkler returns binary_double (
s1    in       varchar2,
s2    in       varchar2       
);

jaro_winkler_similarity

Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match but tries to take into account possible data entry errors.
function jaro_winkler_similarity returns binary_integer (
s1    in                 varchar2,
s2    in                 varchar2       
);

Example 1:


SQL> SELECT UTL_MATCH.EDIT_DISTANCE('espresso', 'expresso')
  2    AS DISTANCE
  3  FROM DUAL;

  DISTANCE
----------
         1

SQL> SELECT UTL_MATCH.EDIT_DISTANCE_SIMILARITY('espresso', 'expresso')
  2    AS DISTANCE
  3  FROM DUAL;

  DISTANCE
----------
        88

SQL> SELECT UTL_MATCH.JARO_WINKLER('espresso', 'expresso')
  2    AS DISTANCE
  3  FROM DUAL;

  DISTANCE
----------
9.25E-001

SQL> SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY ('espresso', 'expresso')
  2    AS DISTANCE
  3  FROM DUAL;

  DISTANCE
----------
        92

Example 2:


CREATE TABLE TEST

        test_id        number primary key, 
        test_name      varchar2(20), 
        test_date      date
); 

INSERT INTO TEST values(1, 'Sony CD Player', '20-FEB-2010');
INSERT INTO TEST values(2, 'Sony CD Player', '24-FEB-2010');
INSERT INTO TEST values(3, 'Pioneer DVD Player', '25-FEB-2010');
INSERT INTO TEST values(4, 'Sony CD Player', '25-FEB-2010');
INSERT INTO TEST values(5, 'Bose Speaker', '22-FEB-2010');
INSERT INTO TEST values(6, 'Tascam CD Burner', '25-FEB-2010');
INSERT INTO TEST values(7, 'Nikon digital camera', '22-FEB-2010');
INSERT INTO TEST values(8, 'Canon digital camera', '26-FEB-2010');

Commit;

Select TEST_ID,TEST_NAME,TEST_DATE
FROM TEST
WHERE  UTL_MATCH.JARO_WINKLER_SIMILARITY(test_name,'dogotal') > 60;

TEST_ID                TEST_NAME            TEST_DATE
---------------------- -------------------- -------------------------
7                      Nikon digital camera 22-FEB-00
8                      Canon digital camera 26-FEB-00


No comments:

Post a Comment