Popular Posts

Saturday, March 19, 2011

Levenshtein distance Or Edit Distance between Two Strings


   The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.
    The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
change a letter,
insert a letter or
delete a letter
The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
d('', '') = 0               -- '' = empty string
d(s, '')  = d('', s) = |s|  -- i.e. length of s
d(s1+ch1, s2+ch2)
  = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
         d(s1+ch1, s2) + 1,
         d(s1, s2+ch2) + 1 )
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.


Java recursive code : 
public class LevenshteinDistance {

public static void main(String[] args) {
String src = "hippo";
String target = "elephant";

System.out.println(getMinimumEditDistnace_Recursion(src, target));

}

private static int getMinimumEditDistnace_Recursion(String src,String target) {
if (src == null || src.length() == 0)
return target != null ? target.length() : 0;

if (target == null || target.length() == 0)
return src != null ? src.length() : 0;

char ch1 = src.charAt(0);
char ch2 = target.charAt(0);

int addOne = 0;
if (ch1 != ch2)
addOne = 1;

String deleteCharFromsrc = "";
if (src.length() != 1)
deleteCharFromsrc = src.substring(1);

String deleteCharFromTarget = "";
if (target.length() != 1)
deleteCharFromTarget = target.substring(1);

return minimum(
getMinimumEditDistnace_Recursion(deleteCharFromsrc,
deleteCharFromTarget)
+ addOne, getMinimumEditDistnace_Recursion(src,
deleteCharFromTarget) + 1, getMinimumEditDistnace_Recursion(
deleteCharFromsrc, target) + 1
);
}

private static int minimum(int a, int b, int c) {
if (a <= b && a <= c) {
return a;
} else {
return (b <= a && b <= c) ? b : c;
}
}
}

Above program creating string literals unnecessarly.We can do with out taking subString by tracking index positions.
public class LevenshteinDistance {

public static void main(String[] args) {
System.out.println(getMinimumEditDistnace_Recursion("", "",0,0));
System.out.println(getMinimumEditDistnace_Recursion("", "a",0,0));
System.out.println(getMinimumEditDistnace_Recursion("aaapppp", "",0,0));
System.out.println(getMinimumEditDistnace_Recursion("frog", "fog",0,0));
System.out.println(getMinimumEditDistnace_Recursion("fly", "ant",0,0));
System.out.println(getMinimumEditDistnace_Recursion("elephant", "happo",0,0));
System.out.println(getMinimumEditDistnace_Recursion("hippo", "elephant",0,0));
System.out.println(getMinimumEditDistnace_Recursion("hippo", "zzzzzzzz",0,0));
System.out.println(getMinimumEditDistnace_Recursion("hello", "hallo",0,0));
}

private static int getMinimumEditDistnace_Recursion(String src,String target,int srcIndex, int targetIndex) {

if (src == null || src.length() == 0 || srcIndex>=src.length())
return target != null ? target.length()- targetIndex : 0;

if (target == null || target.length() == 0 || targetIndex>=target.length())
return src != null ? src.length()- srcIndex : 0;

char ch1 = src.charAt(srcIndex);
char ch2 = target.charAt(targetIndex);

int addOne = 0;
if (ch1 != ch2)
addOne = 1;

return minimum(
getMinimumEditDistnace_Recursion(src,target,srcIndex+1,targetIndex+1)+addOne,
getMinimumEditDistnace_Recursion(src,target,srcIndex,targetIndex+1) + 1,
getMinimumEditDistnace_Recursion(src, target,srcIndex+1,targetIndex) + 1
);
}

private static int minimum(int a, int b, int c) {
if (a <= b && a <= c) {
return a;
} else {
return (b <= a && b <= c) ? b : c;
}
}
}

Algorithm_DP :

1)Set n to be the length of s.
  Set m to be the length of t.
  If n = 0, return m and exit.
  If m = 0, return n and exit.
  Construct a matrix containing 0..m rows and 0..n columns.
2)Initialize the first row to 0..n.
  Initialize the first column to 0..m.
3)Examine each character of s (i from 1 to n).
4)Examine each character of t (j from 1 to m).
5)If s[i] equals t[j], the cost is 0.
  If s[i] doesn't equal t[j], the cost is 1.
6)Set cell d[i,j] of the matrix equal to the minimum of:
  a) The cell immediately above plus 1: d[i-1,j] + 1.
  b) The cell immediately to the left plus 1: d[i,j-1] + 1.
  c) The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7)After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

A two-dimensional matrix, dp[0..|s1|,0..|s2|] is used to hold the edit distance values:

dp[i,j] = dp(s1[1..i], s2[1..j])
dp[0,0] = 0
dp[i,0] = i,  i=1..|s1|
dp[0,j] = j,  j=1..|s2|

dp[i,j] = dpin(dp[i-1,j-1]
             + if src[i]=target[j] then 0 else 1 fi,
             dp[i-1, j] + 1,
             dp[i, j-1] + 1 ),  i=1..|s1|, j=1..|s2|


Java Code : 
public class EditDistanceDp {
public static void main(String[] args) {

String src = "hippo";
String target = "elephant";

// Step 1
int srcLength = src.length();
int targetLength = target.length();
if (srcLength == 0) {
System.out.println(targetLength);
return;
}
if (targetLength == 0) {
System.out.println(srcLength);
return;
}
int[][] dp = new int[srcLength + 1][targetLength + 1];

// Step 2
for (int i = 0; i <= srcLength; i++) {
dp[i][0] = i;
}
for (int j = 0; j <= targetLength; j++) {
dp[0][j] = j;
}

// Step 3
int cost;
for (int i = 1; i <= srcLength; i++) {
// Step 4
for (int j = 1; j <= targetLength; j++) {
// Step 5
if (src.charAt(i - 1) == target.charAt(j - 1)) {
cost = 0;
} else {
cost = 1;
}
// Step 6
dp[i][j] = Minimum(dp[i - 1][j] + 1, dp[i][j - 1] + 1,dp[i - 1][j - 1] + cost);
}
}
// Step 7
System.out.println(dp[srcLength][targetLength]);
}

private static int Minimum(int a, int b, int c) {
int temp = a;
if (b < temp) {
temp = b;
}
if (c < temp) {
temp = c;
}
return temp;
}
}

No comments:

Post a Comment