优化搜索的一种方法是减少对的可能候选集。
基本思路是:对于length n > 2 的每个子字符串,匹配约束(子字符串在输入字符串中至少出现K 次),2 个长度为n - 1 的子字符串匹配要求必须 存在。示例:如果 ("abab" , "abab") 是输入字符串的一对子字符串,则 ("aba" , "aba") 和 ("bab" , "bab") 也必须是输入字符串。
这可用于通过以下方式消除对的候选者:
从length = 1 的初始子串集开始,这样该集仅包含可以找到至少K - 1 相等子串的子串。通过添加下一个字符来扩展这些子字符串中的每一个。现在我们可以消除无法找到足够匹配项的子字符串。重复此操作,直到消除所有子字符串。
现在从理论到实践:
这个基本数据结构简单地通过它在输入字符串中的起点和终点(包括)来表示一个子字符串。
define substr:
int start , end
获取substr表示的字符串的辅助方法:
define getstr:
input: string s , substr sub
return string(s , sub.start , sub.end)
首先为所有字符及其在字符串中的位置生成一个查找表。稍后将需要该表。
define posMap:
input: string in
output: multimap
multimap pos
for int i in [0 , length(in)]
put(pos , in[i] , i)//store the position of character in[i] in the map
return pos
另一个辅助方法生成一组在输入字符串中只出现一次的所有字符索引
define listSingle:
input: multimap pos
output: set
set single
for char c in keys(pos)
if length(get(pos , c)) == 1
add(single , get(get(pos , c) , 0)
return single
一种创建初始匹配对集合的方法。这些对由长度为 1 的子串组成。这些对本身没有指定;该算法仅将子字符串的文本映射到所有出现。 (注意:我在这里使用pair,虽然正确的术语是长度K)
define listSinglePairs:
input: multimap pos
output: multimap
multimap result
for char key in keys(pos)
list ind = get(pos , key)
if length(ind) < 2
continue
string k = toString(key)
for int i in ind
put(result , k , substr(i , i))
return result
此外,还需要一个方法来列出与给定字符串包含相同字符串的所有子字符串:
define matches:
input: string in , substr sub , multimap charmap
output: list
list result
string txt = getstr(in , sub)
list candidates = get(charmap , txt[0])
for int i in [1 , length(txt)[
//increment all elements in candidates
for int c in [0 , size(candidates)[
replace(candidates , c , get(candidates , c) + 1)
list next = get(charmap , txt[i])
//since the indices of all candidates were incremented (index of the previous character in
//in) they now are equal to the indices of the next character in the substring, if it matches
candidates = intersection(candidates , next)
if isEmpty(candidates)
return EMPTY
//candidates now holds the indices of the end of all substrings that
//match the given substring -> convert to list of substr
for int i in candidates
add(result , substr(i - length(txt) , i))
return result
这是完成工作的主例程:
define listMatches:
input: string in , int K
output: multimap
multimap chars = posMap(in)
set single = listSingle(chars)
multimap clvl = listSinglePairs(chars , K)
multimap result
while NOT isEmpty(clvl)
multimap nextlvl
for string sub in clvl
list pairs = get(clvl , sub)
list tmp
//extend all substrings by one character
//substrings that end in a character that only appears once in the
//input string can be ignored
for substr s in pairs
if s.end + 1 > length(in) OR contains(single , s.end + 1)
continue
add(tmp , substr(s.start , s.end + 1)
//map all substrs to their respective string
while NOT isEmpty(tmp)
substr s = get(tmp , 0)
string txt = getstr(s , in)
list match = matches(in , s , chars)
//this substring doesn't have enough pairs
if size(match) < K
continue
//save all matches as solution and candidates for the next round
for substr m in match
put(result , txt , m)
put(nextlvl , txt , m)
//overwrite candidates for the next round with the given candidates
clvl = nextlvl
return result
注意:此算法生成所有子字符串的映射,其中存在与子字符串位置的对。
我希望这是可以理解的(我在解释事情时很糟糕)。