字符串 2 的字谜是字符串 1 的子字符串答案

【问题标题】：Anagram of String 2 is Substring of String 1字符串 2 的字谜是字符串 1 的子字符串
【发布时间】：2015-11-11 05:33:20
【问题描述】：

如何找到字符串 1 的任何字谜是字符串 2 的子字符串？

例如：-

字符串 1 =漫游

String 2=stackoverflow

所以它会返回真，因为“rove”的字谜是“over”，它是字符串 2 的子字符串

【问题讨论】：

标签： string algorithm anagram

【解决方案1】：

编辑时：我的第一个答案在最坏的情况下是二次的。我已将其调整为严格线性：

这是一种基于滑动窗口概念的方法：创建一个以第一个字典的字母为键的字典，其中包含对应值的字母频率计数。可以把它想象成一个目标字典，需要与第二个字符串中的m 连续字母匹配，其中m 是第一个字符串的长度。

首先处理第二个字符串中的第一个m 字母。对于每个这样的字母，如果它作为键出现在目标字典中将相应的值减 1。目标是将所有目标值驱动为 0。将 discrepancy 定义为绝对值的总和处理m字母的第一个窗口后的值。

重复执行以下操作：检查是否为discrepancy == 0，如果是则返回True。否则 - 取字符 m 之前的字母并检查它是否是目标键，如果是 - 将值增加 1。在这种情况下，这会将差异增加或减少 1，相应地进行调整。然后获取第二个字符串的下一个字符并对其进行处理。检查它是否是字典中的键，如果是，则适当调整值和差异。

由于没有嵌套循环，并且每次通过主循环只涉及一些字典查找、比较、加法和减法，因此整体算法是线性的。

一个 Python 3 实现（显示了窗口滑动以及目标计数和差异调整的基本逻辑）：

def subAnagram(s1,s2):
    m = len(s1)
    n = len(s2)
    if m > n: return false
    target = dict.fromkeys(s1,0)
    for c in s1: target[c] += 1

    #process initial window
    for i in range(m):
        c = s2[i]
        if c in target:
            target[c] -= 1
    discrepancy = sum(abs(target[c]) for c in target)

    #repeatedly check then slide:
    for i in range(m,n):
        if discrepancy == 0:
            return True
        else:
            #first process letter from m steps ago from s2
            c = s2[i-m]
            if c in target:
                target[c] += 1
                if target[c] > 0: #just made things worse
                    discrepancy +=1
                else:
                    discrepancy -=1
            #now process new letter:
            c = s2[i]
            if c in target:
                target[c] -= 1
                if target[c] < 0: #just made things worse
                    discrepancy += 1
                else:
                    discrepancy -=1
    #if you get to this stage:
    return discrepancy == 0

典型输出：

>>> subAnagram("rove", "stack overflow")
True
>>> subAnagram("rowe", "stack overflow")
False

为了对其进行压力测试，我从 Project Gutenberg 下载了 Moby Dick 的完整文本。这有超过一百万个字符。书中提到了“Formosa”，因此“moors”的字谜作为 Moby Dick 的子串出现。但是，毫不奇怪，Moby Dick 中没有出现“stackoverflow”的字谜：

>>> f = open("moby dick.txt")
>>> md = f.read()
>>> f.close()
>>> len(md)
1235186
>>> subAnagram("moors",md)
True
>>> subAnagram("stackoverflow",md)
False

最后一次调用大约需要 1 秒来处理 Moby Dick 的完整文本，并验证其中没有出现“stackoverflow”字谜。

【讨论】：

【解决方案2】：

令 L 为 String1 的长度。

遍历 String2 并检查每个长度为 L 的子字符串是否是 String1 的字谜。

在您的示例中，String1 = rove 和 String2 = stackoverflow。

stackoverflow

stac 和 rove 不是字谜，所以移动到下一个长度为 L 的子串。

s粘性溢出

tack 和 rove 不是字谜，依此类推，直到找到子字符串。

更快的方法是检查当前子字符串中的最后一个字母是否存在于 String1 中，即，一旦您发现 stac 和 rove 不是字谜，并看到“c”（这是当前子字符串的最后一个字母） substring) 在 rove 中不存在，您可以简单地完全跳过该子字符串并从 'k' 获取下一个子字符串。

即stac溢出

stac 和 rove 不是字谜。 'rove' 中不存在 'c'，因此只需跳过此子字符串并从 'k' 进行检查：

stackoverflow

这将大大减少比较次数。

编辑：

这是上述方法的 Python 2 实现。

注意：此实现在假设两个字符串中的所有字符均为小写且仅包含字符 a -z 的情况下工作。

def isAnagram(s1, s2):
    c1 = [0] * 26
    c2 = [0] * 26

    # increase character counts for each string
    for i in s1:
        c1[ord(i) - 97] += 1
    for i in s2:
        c2[ord(i) - 97] += 1

    # if the character counts are same, they are anagrams
    if c1 == c2:
        return True
    return False

def isSubAnagram(s1, s2):
    l = len(s1)

    # s2[start:end] represents the substring in s2
    start = 0
    end = l

    while(end <= len(s2)):
        sub = s2[start:end]
        if isAnagram(s1, sub):
            return True
        elif sub[-1] not in s1:
            start += l
            end += l
        else:
            start += 1
            end += 1
    return False

输出：

>>> print isSubAnagram('rove', 'stackoverflow')
True

>>> print isSubAnagram('rowe', 'stackoverflow')
False

【讨论】：

这个算法的复杂度是多少？
@JohnColeman 我已经编辑了答案以在 Python 2 中包含一个实现。假设较短的字符串长度为 k，长度为 n，则最多有 (n - k + 1) 个字谜检查.由于字谜检查在 O(k) 中运行，我会说整体算法复杂度是 O(n)。但是，我的实现受限于所有字符都是小写并且仅在 a-z 范围内的假设。你的是一个更通用的实现。
我不认为你的代码末尾的elif 是你想要的——它不应该是start = end + 1 然后end = start + k 吗？无论如何——有一个有趣的权衡。您的方法允许您在大跳跃中跨过第二个字符串（在最好的情况下），但是当您实际停下来检查字谜时会付出更多工作的代价。如果您的方法具有更好的平均情况性能（也许给定字母频率的某些假设），但我的方法具有更好的最坏情况性能，我不会感到惊讶。
@JohnColeman 你是对的，但我在我的elif 中做了同样的事情：我已经将开始和结束都增加了l（即小写字母'L'，它可能看起来像上面代码中的 1，但不是。l 是较短字符串的长度。请原谅不一致；我在评论中将其称为 k）。

【解决方案3】：

它可以在 O(n^3) 预处理和 O(klogk) 每个查询中完成，其中：n 是“给定字符串”的大小（在您的示例中为字符串 2），k 是查询（示例中的字符串 1）。

预处理：

For each substring s of string2: //O(n^2) of those
    sort s 
    store s in some data base (hash table, for example)

查询：

given a query q:
    sort q
    check if q is in the data base
    if it is - it's an anagram of some substring
    otherwise - it is not.

此答案假设您要检查单个字符串（字符串 2）的多个“查询”（字符串 1），因此尝试优化每个查询的复杂性。

正如 cmets 中所讨论的，您可以懒惰地执行 pro-process 步骤 - 这意味着，当您第一次遇到长度为 k 的查询时，将所有长度为 k 的子字符串插入 DS，并按照原始建议进行操作。

【讨论】：

不需要预先计算所有子串。最好只考虑长度为 k 的子字符串，如果还没有，则将它们添加到数据库中（因此对于每个实际字长仅执行一次）。这将避免将所有长度不对应于真实单词的子字符串添加到数据库中，并且不必在开始时立即执行所有预处理。
@gen-ys 看看我在答案末尾所说的，我假设一个字符串和多个查询（不同长度），并优化它的解决方案 - 所以每个查询都需要最少的时间，代价是更广泛的预处理。
我明白你的回答。我的评论是只将当前搜索词（k1）长度的子字符串放入字典中，因此如果给出另一个相同长度的搜索词（k1），则可以使用字典。如果稍后我们得到另一个长度为 k2 的搜索词，那么我们将所有长度为 k2 的子字符串添加到字典中。优点是您只将搜索实际使用的长度子串（而不是所有可能的子串）放入字典中，并且您将预处理时间分散到多个搜索中。
沉默=准入？ qed
@gen-y-s 沉默 = 让孩子们入睡，这很难（并且忘记回到这个线程）。它基本上是“惰性 VS 急切”的预处理。您建议懒惰地做（这是有道理的），最初的解决方案是急切地做。我在答案本身中添加了对这种方法的提及。

【解决方案4】：

您可能需要创建所有可能的 String1 组合，即 rove，例如 rove、rvoe、reov.. 然后检查此组合中的任何一个是否在 String2 中。

【讨论】：

那将是sum{k! * (n-k) } 字符串。对于合理大小的字符串显然不可行。

【解决方案5】：

//Two string are considered and check whether Anagram of the second     string is 
//present in the first string as part of it (Substring)
//e.g. 'atctv' 'cat' will return true as 'atc' is anagram of cat
//Similarly 'battex' is containing an anagram of 'text' as 'ttex'

public class SubstringIsAnagramOfSecondString {

    public static boolean isAnagram(String str1, String str2){
        //System.out.println(str1+"::" + str2);
        Character[] charArr = new Character[str1.length()];

        for(int i = 0; i < str1.length(); i++){
            char ithChar1 = str1.charAt(i);
            charArr[i] = ithChar1;
        }
        for(int i = 0; i < str2.length(); i++){
            char ithChar2 = str2.charAt(i);
            for(int j = 0; j<charArr.length; j++){
                if(charArr[j] == null) continue;
                if(charArr[j] == ithChar2){
                    charArr[j] = null;
                }
            }
        }
        for(int j = 0; j<charArr.length; j++){
            if(charArr[j] != null)
                return false;
        }
        return true;
    }

    public static boolean isSubStringAnagram(String firstStr, String secondStr){
        int secondLength =  secondStr.length();
        int firstLength =  firstStr.length();
        if(secondLength == 0) return true;
        if(firstLength < secondLength || firstLength == 0) return false;
        //System.out.println("firstLength:"+ firstLength +" secondLength:" + secondLength+ 
                //" firstLength - secondLength:" + (firstLength - secondLength));

        for(int i = 0; i < firstLength - secondLength +1; i++){
            if(isAnagram(firstStr.substring(i, i+secondLength),secondStr )){
                return true;
            }
        }
        return false;

    }
    public static void main(String[] args) {
        System.out.println("isSubStringAnagram(xyteabc,ate): "+ isSubStringAnagram("xyteabc","ate"));

    }

}

【讨论】：