【发布时间】:2021-02-14 04:37:09
【问题描述】:
Longest Common Subsequence (LCS) 问题是:给定两个序列A 和B,找出在A 和B 中都找到的最长子序列。例如,给定A = "peterparker" 和B = "spiderman",最长公共子序列为"pera"。
有人能解释一下这个Longest Common Subsequence算法吗?
def longestCommonSubsequence(A: List, B: List) -> int:
# n = len(A)
# m = len(B)
indeces_A = collections.defaultdict(list)
# O(n)
for i, a in enumerate(A):
indeces_A[a].append(i)
# O(n)
for indeces_a in indeces_A.values():
indeces_a.reverse()
# O(m)
indeces_A_filtered = []
for b in B:
indeces_A_filtered.extend(indeces_A[b])
# The length of indeces_A_filtered is at most n*m, but in practice it's more like O(m) or O(n) as far as I can tell.
iAs = []
# O(m log m) in practice as far as I can tell.
for iA in indeces_A_filtered:
j = bisect.bisect_left(iAs, iA)
if j == len(iAs):
iAs.append(iA)
else:
iAs[j] = iA
return len(iAs)
所写的算法会找到longest common subsequence 的长度,但可以修改为直接找到longest common subsequence。
我在 leetcode link 上寻找最快的 Python 解决方案时发现了这个算法。该算法是该问题最快的 Python 解决方案(40 毫秒),而且它似乎还具有 O(m log m) 时间复杂度,这比大多数其他解决方案的 O(m*n) 时间复杂度要好得多。
我不完全理解它为什么会起作用,并尝试到处寻找已知算法到 Longest Common Subsequence 问题以找到其他提及它的内容,但找不到类似的东西。我能找到的最接近的是Hunt–Szymanski algorithmlink,据说在实践中也有O(m log m),但似乎不是相同的算法。
我的理解:
-
indeces_a被颠倒,以便在iAsfor 循环中保留较小的索引(这在执行下面的演练时更加明显。) - 据我所知,
iAsfor 循环找到了indeces_A_filtered的longest increasing subsequence。
谢谢!
这是算法的演练,例如 A = "peterparker" 和 B = "spiderman"
01234567890
A = "peterparker"
B = "spiderman"
indeces_A = {'p':[0,5], 'e':[1,3,9], 't':[2], 'r':[4,7,10], 'a':[6], 'k':[8]}
# after reverse
indeces_A = {'p':[5,0], 'e':[9,3,1], 't':[2], 'r':[10,7,4], 'a':[6], 'k':[8]}
# -p- --e-- ---r-- a
indeces_A_filtered = [5,0, 9,3,1, 10,7,4, 6]
# the `iAs` loop
iA = 5
j = 0
iAs = [5]
iA = 0
j = 0
iAs = [0]
iA = 9
j = 1
iAs = [0,9]
iA = 3
j = 1
iAs = [0,3]
iA = 1
j = 1
iAs = [0,1]
iA = 10
j = 2
iAs = [0,1,10]
iA = 7
j = 2
iAs = [0,1,7]
iA = 4
j = 2
iAs = [0,1,4]
iA = 6
j = 3
iAs = [0,1,4,6] # corresponds to indices of A that spell out "pera", the LCS
return len(iAs) # 4, the length of the LCS
【问题讨论】:
-
对于由单个重复字母组成的两个相同字符串,这是 O(m log m) 吗?
-
@גלעדברקן 在这种情况下,
A = B = ch*m用于某些字符ch,indeces_A_filtered将是[rev * m],其中rev = list(reversed(range(m)))。 IE。对于m = 4,indeces_A_filtered将等于[3,2,1,0, 3,2,1,0, 3,2,1,0, 3,2,1,0]。因此,在这种情况下,算法将是O(m*2 log m)。在最后的迭代中,iAs将等于[0,1,2,3]和return len(iAs),即4,这是正确的。 -
在字符串
A没有重复字符的情况下,整体时间复杂度为O(l log l),其中l = max(n, m)。 -
您上面的评论中
O(m*2 log m)中的m*2是什么?是m times 2还是m to the power of 2? -
对不起,应该是
O(m^2 log m),所以是m to the power of 2。另外,我发现另一个线程谈到将Longest Common Subsequence减少到Longest Increasing Subsequence,但提到A不能有重复元素:stackoverflow.com/questions/34656050/…。
标签: python algorithm diff subsequence string-algorithm