【问题标题】：Find all anagrams in a string O(n) solution查找字符串 O(n) 解决方案中的所有字谜
【发布时间】：2017-01-20 10:44:10
【问题描述】：

问题来了：

给定一个字符串s和一个非空字符串p，在s中找到p的变位词的所有起始索引。

Input: s: "cbaebabacd" p: "abc"
Output: [0, 6]
Input: s: "abab" p: "ab"
Output: [0, 1, 2]

这是我的解决方案

vector<int> findAnagrams(string s, string p) {
    vector<int> res, s_map(26,0), p_map(26,0);
    int s_len = s.size();
    int p_len = p.size();
    if (s_len < p_len) return res;
    for (int i = 0; i < p_len; i++) {
        ++s_map[s[i] - 'a'];
        ++p_map[p[i] - 'a'];
    }
    if (s_map == p_map)
        res.push_back(0);
    for (int i = p_len; i < s_len; i++) {
        ++s_map[s[i] - 'a'];
        --s_map[s[i - p_len] - 'a'];
        if (s_map == p_map)
            res.push_back(i - p_len + 1);
    }
    return res;
}

但是，我认为这是 O(n^2) 解决方案，因为我必须比较向量 s_map 和 p_map。这个问题是否存在 O(n) 解决方案？

【问题讨论】：

这不是一个非常适合 Stack Overflow 的问题。你知道 O(n) 算法吗？你找过一个吗？如果您正在寻找一般建议，也许Quora 是一个更好的地方。请注意，在谈论排列时，您不太可能找到 O(n) 解决方案。
不确定这是否更好，但您可以首先生成 p 的所有排列，然后使用类似 aho-corasick 字符串匹配的东西。当你说 O(n) 时，n 指的是什么（因为有两个参数：s 字符串长度和 p 字符串长度）。
可能也对此库感兴趣：combinatorics.codeplex.com

标签： algorithm data-structures

【解决方案1】：

假设p 的大小为n。

假设您有一个大小为 26 的数组 A，其中填充了 p 包含的 a、b、c、... 的数量。

然后创建一个大小为 26 的新数组 B，用 0 填充。

让我们调用给定的（大）字符串s。

首先，您在s 的第一个n 字符中使用a、b、c、...的编号初始化B。

然后您在s 中遍历每个大小为n 的单词，始终更新B 以适应这个n 大小的单词。

总是B 匹配A 你会有一个索引，我们有一个字谜。

要将B 从一个n 大小的单词更改为另一个，请注意您只需在B 中删除前一个单词的第一个字符并添加下一个单词的新字符。

看例子：

Input
s: "cbaebabacd" 
p: "abc"          n = 3 (size of p)

A = {1, 1, 1, 0, 0, 0, ... }  // p contains just 1a, 1b and 1c.

B = {1, 1, 1, 0, 0, 0, ... }  // initially, the first n-sized word contains this.

compare(A,B)

for i = n; i < size of s; i++ {
    B[ s[i-n] ]--;
    B[ s[ i ] ]++;
    compare(A,B)
}

并假设compare(A,B) 打印的索引总是 A 匹配 B。

总复杂度为：

first fill of A  = O(size of p)
first fill of B  = O(size of s)
first comparison = O(26)
for-loop = |s| * (2 + O(26)) = |s| * O(28) = O(28|s|) = O(size of s)
____________________________________________________________________
2 * O(size of s) + O(size of p) + O(26)

与 s 大小成线性关系。

【讨论】：

【解决方案2】：

您的解决方案是 O(n) 解决方案。 s_map 和 p_map 向量的大小是一个常数 (26)，它不依赖于 n。所以s_map 和p_map 之间的比较需要固定的时间，无论n 有多大。

您的解决方案大约需要 26 * n 整数比较才能完成，即 O(n)。

【讨论】：

对于大小为m 的通用字母表，OP 的算法是O(n + m)，其中n 是输入大小的合理度量。对于足够大的输入字符串，m 几乎可以忽略不计。
@Code-Apprentice 我同意。它还使用 O(p + m) 空间，其中 p 是要搜索的模式的大小。
@Code-Apprentice 实际上，我说错了。如果模式大小为 p，要搜索的字符串大小为 n，字母大小为 m，我认为算法需要 O(p + (n-p)*m)。这可能比 O(n+m) 多很多

【解决方案3】：

// In papers on string searching algorithms, the alphabet is often
// called Sigma, and it is often not considered a constant. Your
// algorthm works in (Sigma * n) time, where n is the length of the
// longer string. Below is an algorithm that works in O(n) time even
// when Sigma is too large to make an array of size Sigma, as long as
// values from Sigma are a constant number of "machine words".

// This solution works in O(n) time "with high probability", meaning
// that for all c > 2 the probability that the algorithm takes more
// than c*n time is 1-o(n^-c). This is a looser bound than O(n)
// worst-cast because it uses hash tables, which depend on randomness.

#include <functional>
#include <iostream>
#include <type_traits>
#include <vector>
#include <unordered_map>
#include <vector>

using namespace std;

// Finding a needle in a haystack. This works for any iterable type
// whose members can be stored as keys of an unordered_map.
template <typename T>
vector<size_t> AnagramLocations(const T& needle, const T& haystack) {
  // Think of a contiguous region of an ordered container as
  // representing a function f with the domain being the type of item
  // stored in the container and the codomain being the natural
  // numbers. We say that f(x) = n when there are n x's in the
  // contiguous region.
  //
  // Then two contiguous regions are anagrams when they have the same
  // function. We can track how close they are to being anagrams by
  // subtracting one function from the other, pointwise. When that
  // difference is uniformly 0, then the regions are anagrams.
  unordered_map<remove_const_t<remove_reference_t<decltype(*needle.begin())>>,
                intmax_t> difference;
  // As we iterate through the haystack, we track the lead (part
  // closest to the end) and lag (part closest to the beginning) of a
  // contiguous region in the haystack. When we move the region
  // forward by one, one part of the function f is increased by +1 and
  // one part is decreased by -1, so the same is true of difference.
  auto lag = haystack.begin(), lead = haystack.begin();

  // To compare difference to the uniformly-zero function in O(1)
  // time, we make sure it does not contain any points that map to
  // 0. The the property of being uniformly zero is the same as the
  // property of having an empty difference.
  const auto find = [&](const auto& x) {
    difference[x]++;
    if (0 == difference[x]) difference.erase(x);
  };
  const auto lose = [&](const auto& x) {
    difference[x]--;
    if (0 == difference[x]) difference.erase(x);
  };
  vector<size_t> result;
  // First we initialize the difference with the first needle.size()
  // items from both needle and haystack.
  for (const auto& x : needle) {
    lose(x);
    find(*lead);
    ++lead;
    if (lead == haystack.end()) return result;
  }
  size_t i = 0;
  if (difference.empty()) result.push_back(i++);
  // Now we iterate through the haystack with lead, lag, and i (the
  // position of lag) updating difference in O(1) time at each spot.
  for (; lead != haystack.end(); ++lead, ++lag, ++i) {
    find(*lead);
    lose(*lag);
    if (difference.empty()) result.push_back(i);
  }
  return result;
}

int main() {
  string needle, haystack;
  cin >> needle >> haystack;
  const auto result = AnagramLocations(needle, haystack);
  for (auto x : result) cout << x << ' ';
}

【讨论】：