Archive
string distances
See the Jellyfish project: “Jellyfish is a python library for doing approximate and phonetic matching of strings“.
Jellyfish implements the following algorithms: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance.
See the project page for more info.
Levenshtein distance
The Levenshtein distance (or edit distance) between two strings is the minimal number of “edit operations” required to change one string into the other. The two strings can have different lengths. There are three kinds of “edit operations”: deletion, insertion, or alteration of a character in either string.
Example: the Levenshtein distance of “ag-tcc” and “cgctca” is 3.
#!/usr/bin/env python
def LD(s,t):
s = ' ' + s
t = ' ' + t
d = {}
S = len(s)
T = len(t)
for i in range(S):
d[i, 0] = i
for j in range (T):
d[0, j] = j
for j in range(1,T):
for i in range(1,S):
if s[i] == t[j]:
d[i, j] = d[i-1, j-1]
else:
d[i, j] = min(d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1)
return d[S-1, T-1]
a = 'ag-tcc'
b = 'cgctca'
print LD(a, b) # 3
The implementation is from here.
Hamming distance
The Hamming distance is defined between two strings of equal length. It measures the number of positions with mismatching characters.
Example: the Hamming distance between “toned” and “roses” is 3.
#!/usr/bin/env python
def hamming_distance(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
if __name__=="__main__":
a = 'toned'
b = 'roses'
print hamming_distance(a, b) # 3
If you need the number of matching character positions:
#!/usr/bin/env python
def similarity(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 == ch2 for ch1, ch2 in zip(s1, s2))
if __name__=="__main__":
a = 'toned'
b = 'roses'
print similarity(a, b) # 2
Actually this is equal to len(s1) - hamming_distance(s1, s2). Remember, len(s1) == len(s2).
More info on zip() here.
