Source: $ORACLE_HOME/rdbms/admin/utlmatch.sql
Oracle added the UTL_MATCH package in Version 10gRelease 2 to compare strings. The four functions included in the package use different methods to compare a source string and destination string, and return an assessment of what it would take to turn the source into the destination string. The functions are broken down into two categories. The categories are actually the algorithms employed to analyze the strings.
Levenshtein Distance
The Levenshtein Distance (LD) algorithm, commonly called the Edit Distance (ED) algorithm, is the older of the two supported methods. It measures the distance between the source and destination strings. By distance, we’re referring to the number of changes required to turn the source string into the destination string.
Jaro-Winkler
The Jaro-Winkler algorithm is the second category of algorithms used in UTL_ MATCH. These functions take the same two arguments, but instead of simply calculating the number of steps required to change the source string to the destination string, it determines how closely the two strings agree with each other. The algorithm also tries to take into account the possibility of a data entry error when determining similarity.
Procedures/Functions
edit_distance
Returns the number of changes required to turn the source string into the destination string using the Levenshtein Distance algorithm.
function edit_distance returns binary_integer (
s1 in varchar2,
s2 in varchar2
);
edit_distance_similarity
Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match.
function edit_distance_similarity returns binary_integer (
s1 in varchar2,
s2 in varchar2
);
jaro_winkler
Instead of simply calculating the number of steps required to change the source string to the destination string returns similarity based on Jaro-Winkler distance algorithm, determines how closely the two strings agree with each other and tries to take into account the possibility of a data entry error.
function jaro_winkler returns binary_double (
s1 in varchar2,
s2 in varchar2
);
jaro_winkler_similarity
Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match but tries to take into account possible data entry errors.
function jaro_winkler_similarity returns binary_integer (
s1 in varchar2,
s2 in varchar2
);
Example 1:
SQL> SELECT UTL_MATCH.EDIT_DISTANCE('espresso', 'expresso')
2 AS DISTANCE
3 FROM DUAL;
DISTANCE
----------
1
SQL> SELECT UTL_MATCH.EDIT_DISTANCE_SIMILARITY('espresso', 'expresso')
2 AS DISTANCE
3 FROM DUAL;
DISTANCE
----------
88
SQL> SELECT UTL_MATCH.JARO_WINKLER('espresso', 'expresso')
2 AS DISTANCE
3 FROM DUAL;
DISTANCE
----------
9.25E-001
SQL> SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY ('espresso', 'expresso')
2 AS DISTANCE
3 FROM DUAL;
DISTANCE
----------
92
Example 2:
CREATE TABLE TEST
(
test_id number primary key,
test_name varchar2(20),
test_date date
);
INSERT INTO TEST values(1, 'Sony CD Player', '20-FEB-2010');
INSERT INTO TEST values(2, 'Sony CD Player', '24-FEB-2010');
INSERT INTO TEST values(3, 'Pioneer DVD Player', '25-FEB-2010');
INSERT INTO TEST values(4, 'Sony CD Player', '25-FEB-2010');
INSERT INTO TEST values(5, 'Bose Speaker', '22-FEB-2010');
INSERT INTO TEST values(6, 'Tascam CD Burner', '25-FEB-2010');
INSERT INTO TEST values(7, 'Nikon digital camera', '22-FEB-2010');
INSERT INTO TEST values(8, 'Canon digital camera', '26-FEB-2010');
Commit;
Select TEST_ID,TEST_NAME,TEST_DATE
FROM TEST
WHERE UTL_MATCH.JARO_WINKLER_SIMILARITY(test_name,'dogotal') > 60;
TEST_ID TEST_NAME TEST_DATE
---------------------- -------------------- -------------------------
7 Nikon digital camera 22-FEB-00
8 Canon digital camera 26-FEB-00
No comments:
Post a Comment