Given the following, i can find the longest common substring:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[out]:
foo bar
But how do i ensure that the longest common substring respect English word boundary and don't cut up a word? For example, the following sentences:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
outputs the follow which is NOT desired since it breaks up the word kappa
from s2:
a foo bar
The desired output is still:
foo bar
I've tried also an ngram way of getting the longest common substring respecting word boundary but is there other way that deals with strings without calculating ngrams? (see answer)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…