在给定的源字符串中查找给定字符串的所有字符串排列答案

【问题标题】：Find all string permutations of given string in given source string在给定的源字符串中查找给定字符串的所有字符串排列
【发布时间】：2017-05-26 02:46:38
【问题描述】：

给定一个模式字符串：'foo' 和一个源字符串：'foobaroofzaqofom'，我们需要以任意字母顺序查找所有出现的单词模式字符串。因此，对于给定的示例解决方案将如下所示：['foo', 'oof', 'ofo']。

我有一个解决方案，但我不确定它是否是最有效的：

创建模式字符串字符的 hash_map，其中每个字符是键，每个值是模式字符的计数器。对于给定的示例，它将是 {{f: 1}, {o: 2}}
查看源字符串，如果找到 hash_map 中的元素之一，则尝试找到 pattern 的所有其余元素
如果找到所有元素而不是我们的解决方案，如果不继续

这是一个c++实现：

set<string> FindSubstringPermutations(string& s, string& p)
{
    set<string> result; 
    unordered_map<char, int> um;

    for (auto ch : p)
    {
        auto it = um.find(ch);
        if (it == um.end())
            um.insert({ ch, 1 });
        else
            um[ch] += 1;
    }

    for (int i = 0; i < (s.size() - p.size() + 1); ++i)
    {
        auto it = um.find(s[i]);
        if (it != um.end())
        {
            decltype (um) um_c = um;
            um_c[s[i]] -= 1;
            for (int t = (i + 1); t < i + p.size(); ++t)
            {
                auto it = um_c.find(s[t]);
                if (it == um_c.end())
                    break;
                else if (it->second == 0)
                    break;
                else
                    it->second -= 1;
            }

            int sum = 0;
            for (auto c : um_c)
                sum += c.second;

            if (sum == 0)
                result.insert(s.substr(i, p.size()));
        }
    }

    return result;
}

复杂度接近O(n)，我不知道如何更精确地计算。

所以问题是：有没有有效的解决方案，因为使用 hash_map 有点小技巧，我认为使用简单数组和已找到元素的标志可能会有更有效的解决方案。

【问题讨论】：

标签： string algorithm hashmap

【解决方案1】：

您可以使用与滑动窗口配合使用的顺序不变散列算法来稍微优化一下。

这种散列算法的一个例子可以是

int hash(string s){
    int result = 0;

    for(int i = 0; i < s.length(); i++)
        result += s[i];

    return result;
}

这个算法有点过于简单，除了性能（即可能的散列值的分布和数量）之外的所有方面都相当糟糕，但这并不难改变。

这种哈希算法的优势在于：

hash("abc") == hash("acb") == hash("bac") == ...

在这个算法中使用滑动窗口非常简单：

string s = "abcd";

hash(s.substring(0, 3)) + 'd' - 'a' == hash(s.substring(1, 3));

这种散列方法的这两个属性允许我们做这样的事情：

int hash(string s){
    return sum(s.chars);
}

int slideHash(int oldHash, char slideOut, char slideIn){
    return oldHash - slideOut + slideIn;
}

int findPermuted(string s, string pattern){
    int patternHash = hash(pattern);
    int slidingHash = hash(s.substring(0, pattern.length()));

    if(patternHash == slidingHash && isPermutation(pattern, s.substring(0, pattern.length())
        return 0;

    for(int i = 0; i < s.length() - pattern.length(); i++){
        slidingHash = slideHash(slidingHash, s[i], s[i + pattern.length()]);

        if(patternHash == slidingHash)
            if(isPermutation(pattern, s.substring(i + 1, pattern.length())
                return i + 1;
    }

    return -1;
}

这基本上是Rabin-Karp-algorithm 的修改版本，适用于置换字符串。这种方法的主要优点是实际上需要比较的字符串更少，这带来了相当多的优势。这在这里尤其适用，因为比较（检查一个字符串是否是另一个字符串的排列）本身已经非常昂贵。

注意：
上面的代码只是作为一个想法的演示。它旨在易于理解而不是性能，不应直接使用。

编辑：
不应使用上述顺序不变滚动散列算法的“实现”，因为它在数据分布方面的性能极差。当然，这种散列显然存在一些问题：唯一可以生成散列的是字符的实际值（没有索引！），需要使用可逆操作来累积。

更好的方法是将每个字符映射到一个素数（不要使用 2 ！！！）。由于所有操作都是模2^(8 * sizeof(hashtype))（整数溢出），我们需要为所有使用的素数生成一个乘法逆模2^(8 * sizeof(hashtype)) 的表。我不会介绍如何生成这些表格，因为这里已经有很多关于该主题的可用资源。

最终的哈希将如下所示：

map<char, int> primes = generatePrimTable();
map<int, int> inverse = generateMultiplicativeInverses(primes);

unsigned int hash(string s){
    unsigned int hash = 1;
    for(int i = 0; i < s.length(); i++)
        hash *= primes[s[i]];

    return hash;
}

unsigned int slideHash(unsigned int oldHash, char slideOut, char slideIn){
    return oldHash * inverse[primes[slideOut]] * primes[slideIn];
}

请记住，此解决方案适用于无符号整数。

【讨论】：

您的哈希函数绝对错误：ABC 和 BBB 具有相同的值，更一般地说，您永远无法将大字符串哈希成小整数而不会发生冲突。毕竟散列并不总是能发挥作用（至少在最坏的情况下）。最后，使用存储桶解决冲突可能会导致一个存储桶中有许多字符串。
而且它根本不是 O(n)，它是 O(nm)，其中 n 是原始字符串的大小，m 是较小字符串的大小。这个答案是在错误的地方使用好主意的完美示例。一个简单的蛮力算法也是 O(nm)。
@SaeedAmiri 我在帖子中已经多次提到它：哈希算法应该展示一个想法并且易于理解，而不是实际使用。争论散列算法是行不通的，因为有冲突是一个奇怪的说法。至于O(n)，我完全不知道你从哪里得到这个说法，但绝对不是我的回答。至于方法本身，这是 Rabin-Carp 算法的一种变体，它旨在在长字符串中找到一个短模式，所以我不明白为什么不应该使用这种算法。
@SaeedAmiri 至于O(nm)：即使是Boyer-Moore 也不会超过O(nm) 作为最坏的情况。哦，最后但同样重要的是：还有另一个 answer 使用完全相同的方法和适当的哈希算法。
重点是你连主要思想都没有演示，再看看你的算法，你在做's.substring(i + 1, pattern.length()' for O (n)次。意思是你建议的算法或想法是O(nm)，Rabin Karp的所有技术都是避免检查所有这样的子串。否则最后给自己带来很多麻烦是不明智的提供一个不能正常工作的算法，并且与一个平凡的正确算法具有相同的运行时间。

【解决方案2】：

用于字谜的典型滚动散列函数

使用素数的乘积
这仅适用于相对较短的模式
几乎所有普通字的哈希值都适合 64 位值而不会溢出。
Based on this anagram matcher

/* braek; */
/* 'foobaroofzaqofom' */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef unsigned long long HashVal;
static HashVal hashchar (unsigned char ch);
static HashVal hashmem (void *ptr, size_t len);

unsigned char primes26[] =
{ 5,71,79,19,2,83,31,43,11,53,37,23,41,3,13,73,101,17,29,7,59,47,61,97,89,67, };
/*********************************************/
static HashVal hashchar (unsigned char ch)
{
HashVal val=1;

if (ch >= 'A' && ch <= 'Z' ) val = primes26[ ch - 'A'];
else if (ch >= 'a' && ch <= 'z' ) val = primes26[ ch - 'a'];

return val;
}

static HashVal hashmem (void *ptr, size_t len)
{
size_t idx;
unsigned char *str = ptr;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        val *= hashchar ( str[idx] );
        }

return val;
}
/*********************************************/


unsigned char buff [4096];
int main (int argc, char **argv)
{
size_t patlen,len,pos,rotor;
int ch;
HashVal patval;
HashVal rothash=1;

patlen = strlen(argv[1]);
patval = hashmem( argv[1], patlen);
// fprintf(stderr, "Pat=%s, len=%zu, Hash=%llx\n", argv[1], patlen, patval);

for (rotor=pos=len =0; ; len++) {
        ch=getc(stdin);
        if (ch == EOF) break;

        if (ch < 'A' || ch > 'z') { pos = 0; rothash = 1; continue; }
        if (ch > 'Z' && ch < 'a') { pos = 0; rothash = 1; continue; }
                /* remove old char from rolling hash */
        if (pos >= patlen) { rothash /= hashchar(buff[rotor]); }
                /* add new char to rolling hash */
        buff[rotor] = ch;
        rothash *= hashchar(buff[rotor]);

        // fprintf(stderr, "%zu: [rot=%zu]pos=%zu, Hash=%llx\n", len, rotor, pos, rothash);

        rotor = (rotor+1) % patlen;
                /* matched enough characters ? */
        if (++pos < patlen) continue;
                /* correct hash value ? */
        if (rothash != patval) continue;
        fprintf(stdout, "Pos=%zu\n", len);
        }

return 0;
}

输出/结果：

$ ./a.out foo < anascan.c
Pos=21
Pos=27
Pos=33

更新。对于不喜欢素数乘积的人，这里有一个 taxinumber 立方和（+ 附加直方图检查）实现。这也应该是 8 位干净的。注意立方体不是必需的；它同样适用于正方形。或者只是总和。（最终的直方图检查还有一些工作要做）

/* braek; */
/*  'foobaroofzaqofom' */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef unsigned long long HashVal;
static HashVal hashchar (unsigned char ch);
static HashVal hashmem (void *ptr, size_t len);

/*********************************************/
static HashVal hashchar (unsigned char ch)
{
HashVal val=1+ch;

return val*val*val;
}

static HashVal hashmem (void *ptr, size_t len)
{
size_t idx;
unsigned char *str = ptr;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        val += hashchar ( str[idx] );
        }

return val;
}
/*********************************************/
int main (int argc, char **argv)
{
size_t patlen,len,rotor;
int ch;
HashVal patval;
HashVal rothash=1;
unsigned char *patstr;
unsigned pathist[256] = {0};
unsigned rothist[256] = {0};
unsigned char cycbuff[1024];

patstr = (unsigned char*) argv[1];
patlen = strlen((const char*) patstr);
patval = hashmem( patstr, patlen);

for(rotor=0; rotor < patlen; rotor++) {
        pathist [ patstr[rotor] ] += 1;
        }
fprintf(stderr, "Pat=%s, len=%zu, Hash=%llx\n", argv[1], patlen, patval);

for (rotor=len =0; ; len++) {
        ch=getc(stdin);
        if (ch == EOF) break;

                /* remove old char from rolling hash */
        if (len >= patlen) {
                rothash -= hashchar(cycbuff[rotor]);
                rothist [ cycbuff[rotor] ] -= 1;
                }
                /* add new char to rolling hash */
        cycbuff[rotor] = ch;
        rothash += hashchar(cycbuff[rotor]);
        rothist [ cycbuff[rotor] ] += 1;

        // fprintf(stderr, "%zu: [rot=%zu], Hash=%llx\n", len, rotor, rothash);

        rotor = (rotor+1) % patlen;
                /* matched enough characters ? */
        if (len < patlen) continue;
                /* correct hash value ? */
        if (rothash != patval) continue;
                /* correct histogram? */
        if (memcmp(rothist,pathist, sizeof pathist)) continue;
        fprintf(stdout, "Pos=%zu\n", len-patlen);
        }

return 0;
}

【讨论】：

如果你提供一个使用素数乘法的算法，并且它只适用于小模式，那么你的算法的目的是什么，一个朴素的算法在理论上和实践上都要快得多。
这就是我所说的：它只适用于合理大小的搜索词，例如文本中的词。幼稚在实践中可能更快，但在理论上不会。我的是 O(N)，朴素的是 O(N*M)，其中 M 是搜索词中的排列数。
合理大小：您上一次在文本中使用 20 个或更多字符的单词是什么时候？而且现在 mult/div 的性能也没有那么差。
我认为模式不一定是一个单词。事实上，它可以是一组关键词。此外，如果您假设像 20 这样的小字，那么朴素算法已经是 O(n)。顺便说一句，你的方法，除了你的哈希函数，不是很糟糕。