计算选择 K 个相等字符串的方法答案

【问题标题】：Count ways to choose K equal strings计算选择 K 个相等字符串的方法
【发布时间】：2015-06-06 15:07:01
【问题描述】：

给定一个由 N 个小写英文字母组成的字符串 S。假设我们有一个由字符串 S 的所有非空子字符串组成的列表 L。

现在我们需要回答 Q 查询。对于第 i 个查询，我需要计算从列表 L 中准确选择 K 个相等字符串的方法数。

注意：对于每个 K，我们将有不同的 K 值。

为了避免溢出，我需要取模 10^9+7。

示例：让 S=ababa 我们有 2 个查询。每个查询的 K 值为：

2
3

那么第一个查询的答案是 7，第二个查询的答案是 1。

作为列表 L = {“a”、“b”、“a”、“b”、“a”、“ab”、“ba”、“ab”、“ba”、“aba”、“bab” ", "aba", "abab", "baba", "ababa"}

对于查询 1：有七种方法可以选择两个相等的字符串 ("a", "a"), ("a", "a"), ("a", "a"), ("b" , "b"), ("ab", "ab"), ("ba", "ba"), ("aba", "aba")。

对于查询 2：有一种方法可以选择三个相等的字符串 - ("a", "a", "a")。

现在的问题是 N

【问题讨论】：

标签： algorithm

【解决方案1】：

优化搜索的一种方法是减少对的可能候选集。

基本思路是：对于length n > 2 的每个子字符串，匹配约束（子字符串在输入字符串中至少出现K 次），2 个长度为n - 1 的子字符串匹配要求必须存在。示例：如果 ("abab" , "abab") 是输入字符串的一对子字符串，则 ("aba" , "aba") 和 ("bab" , "bab") 也必须是输入字符串。

这可用于通过以下方式消除对的候选者：
从length = 1 的初始子串集开始，这样该集仅包含可以找到至少K - 1 相等子串的子串。通过添加下一个字符来扩展这些子字符串中的每一个。现在我们可以消除无法找到足够匹配项的子字符串。重复此操作，直到消除所有子字符串。

现在从理论到实践：

这个基本数据结构简单地通过它在输入字符串中的起点和终点（包括）来表示一个子字符串。

define substr:
    int start , end

获取substr表示的字符串的辅助方法：

define getstr:
    input: string s , substr sub

    return string(s , sub.start , sub.end)

首先为所有字符及其在字符串中的位置生成一个查找表。稍后将需要该表。

define posMap:
    input: string in
    output: multimap

    multimap pos

    for int i in [0 , length(in)]
        put(pos , in[i] , i)//store the position of character in[i] in the map

    return pos

另一个辅助方法生成一组在输入字符串中只出现一次的所有字符索引

define listSingle:
    input: multimap pos
    output: set

    set single
    for char c in keys(pos)
        if length(get(pos , c)) == 1
            add(single , get(get(pos , c) , 0)

    return single

一种创建初始匹配对集合的方法。这些对由长度为 1 的子串组成。这些对本身没有指定；该算法仅将子字符串的文本映射到所有出现。（注意：我在这里使用pair，虽然正确的术语是长度K）

define listSinglePairs:
    input: multimap pos
    output: multimap

    multimap result

    for char key in keys(pos)
        list ind = get(pos , key)

        if length(ind) < 2
            continue

        string k = toString(key)

        for int i in ind
            put(result , k , substr(i , i))

     return result

此外，还需要一个方法来列出与给定字符串包含相同字符串的所有子字符串：

define matches:
    input: string in , substr sub , multimap charmap
    output: list

    list result

    string txt = getstr(in , sub)

    list candidates = get(charmap , txt[0])

    for int i in [1 , length(txt)[
        //increment all elements in candidates
        for int c in [0 , size(candidates)[
            replace(candidates , c , get(candidates , c) + 1)

        list next = get(charmap , txt[i])

        //since the indices of all candidates were incremented (index of the previous character in
        //in) they now are equal to the indices of the next character in the substring, if it matches
        candidates = intersection(candidates , next)

        if isEmpty(candidates)
            return EMPTY

    //candidates now holds the indices of the end of all substrings that
    //match the given substring -> convert to list of substr
    for int i in candidates
        add(result , substr(i - length(txt) , i))

    return result

这是完成工作的主例程：

define listMatches:
    input: string in , int K
    output: multimap

    multimap chars = posMap(in)
    set single = listSingle(chars)

    multimap clvl = listSinglePairs(chars , K)

    multimap result

    while NOT isEmpty(clvl)
        multimap nextlvl

        for string sub in clvl
            list pairs = get(clvl , sub)

            list tmp

            //extend all substrings by one character
            //substrings that end in a character that only appears once in the
            //input string can be ignored
            for substr s in pairs
                if s.end + 1 > length(in) OR contains(single , s.end + 1)
                    continue

                add(tmp , substr(s.start , s.end + 1)

            //map all substrs to their respective string
            while NOT isEmpty(tmp)
                substr s = get(tmp , 0)
                string txt = getstr(s , in)

                list match = matches(in , s , chars)

                //this substring doesn't have enough pairs 
                if size(match) < K
                    continue

                //save all matches as solution and candidates for the next round
                for substr m in match
                    put(result , txt , m)
                    put(nextlvl , txt , m)

        //overwrite candidates for the next round with the given candidates
        clvl = nextlvl

    return result

注意：此算法生成所有子字符串的映射，其中存在与子字符串位置的对。

我希望这是可以理解的（我在解释事情时很糟糕）。

【讨论】：

我想我得到了你正在尝试的东西。你能说出这个算法每次查询所花费的时间吗？
@python_slayer 我在运行时分析方面真的很糟糕，但我认为最坏的情况是 matches 的 O(n ^ 2) 和 listMatches 的 O(n ^ 4)