【发布时间】:2013-12-14 13:49:44
【问题描述】:
我正在制作一个 Python 脚本,该脚本可以找到两个字符串共享的所有 n 字长子字符串的(可能最长的)长度,而忽略尾随标点符号。给定两个字符串:
“这是一个示例字符串”
"这也是一个示例字符串"
我希望脚本识别这些字符串有一个共同的 2 个单词序列(“这是”),然后是一个共同的 3 个单词序列(“一个示例字符串”)。这是我目前的做法:
a = "this is a sample string"
b = "this is also a sample string"
aWords = a.split()
bWords = b.split()
#create counters to keep track of position in string
currentA = 0
currentB = 0
#create counter to keep track of longest sequence of matching words
matchStreak = 0
#create a list that contains all of the matchstreaks found
matchStreakList = []
#create binary switch to control the use of while loop
continueWhileLoop = 1
for word in aWords:
currentA += 1
if word == bWords[currentB]:
matchStreak += 1
#to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
if currentB + 1 < len(bWords):
currentB += 1
#in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
if currentA == len(aWords):
matchStreakList.append(matchStreak)
elif word != bWords[currentB]:
#because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
if matchStreak >= 1:
matchStreakList.append(matchStreak)
matchStreak = 0
while word != bWords[currentB]:
#the two words don't match. If you can move b forward one word, do so, then check for another match
if currentB + 1 < len(bWords):
currentB += 1
#if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
elif currentB + 1 == len(bWords):
currentB = 0
break
if word == bWords[currentB]:
matchStreak += 1
#now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
if currentB + 1 < len(bWords):
currentB += 1
elif currentB + 1 == len(bWords):
#we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
if currentA == len(aWords):
matchStreakList.append(matchStreak)
currentB = 0
break
print matchStreakList
此脚本正确输出公共字长子串 (2, 3) 的(最大)长度,并且迄今为止对所有测试都这样做了。我的问题是:是否有一对两个字符串上面的方法不起作用?更重要的是:是否存在可用于查找两个字符串共享的所有 n 字长子字符串的最大长度的 Python 库或众所周知的方法?
[这个问题与最长公共子串问题不同,这只是我正在寻找的一个特例(因为我想找到所有公共子串,而不仅仅是最长的公共子串)。 This SO post 建议诸如 1) 聚类分析、2) 编辑距离例程和 3) 最长公共序列算法等方法可能是合适的方法,但我没有找到任何可行的解决方案,我的问题可能稍微容易一些链接中提到,因为我正在处理以空格为界的单词。]
编辑:
我开始在这个问题上悬赏。以防它对其他人有所帮助,我想澄清一些要点。首先,@DhruvPathak 下面建议的有用答案没有找到两个字符串共享的所有最大长度的 n 字长子字符串。例如,假设我们正在分析的两个字符串是:
“他们刚出生时都是一张一尘不染的白纸 但它们要被每一根鹅毛笔潦草潦草涂抹”
和
“当你刚开始的时候,你都是白色的,一张可爱的、一尘不染的纸 出生;但你要被每一只鹅的潦草潦草 羽毛笔”
在这种情况下,最长 n 个字长的子串列表(不考虑尾随标点符号)是:
all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
使用以下例程:
#import required packages
import difflib
#define function we'll use to identify matches
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
print matches(a,b)
一个得到输出:
['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
首先,我不确定如何从这个列表中选择只包含整个单词的子字符串。其次,该列表不包括“are”,这是所需的最大长度的公共 n 字长子字符串之一。有没有一种方法可以找到这两个字符串共享的所有最长 n 个字长的子字符串(“你都是……”和“他们都是……”)?
【问题讨论】:
-
您希望输出是所有常见子字符串的列表吗?它们可以重叠吗?
-
试试看diff-match-patch,它是一组进行模糊字符串匹配的谷歌代码,里面可能有一些你可以使用的东西。
-
但是“一二三”和“一二二三”有两个最大长度的公共子串重叠。
-
那么您最感兴趣的是找到 最长的子字符串,还是更复杂的?我只是在想,如果它们不能重叠,那么它似乎会使问题变得更加复杂,因为使一个子字符串更长可能会使另一个子字符串更短,并且您需要一个评分系统来决定返回子字符串是否更好长度为 3 或长度为 2 和长度为 4 的另一个。
-
我正在寻找所有最长的公共子字符串,无论它们是否重叠。抱歉上面说错了。 (我认为重叠会阻止子字符串尽可能长,但@RemcoGerlich 帮助我证明情况并非如此。)
标签: python string algorithm pattern-matching substring