在 Python 中搜索子字符串的字符排列答案

【问题标题】：Search for permutation of characters of a substring in Python在 Python 中搜索子字符串的字符排列
【发布时间】：2014-10-10 08:19:58
【问题描述】：

我正在尝试从一行文本中提取字符串的出现及其字符的所有排列。

例如，我需要从以下字符串 s 中提取字符串 t = 'ABC' 及其所有排列：'ABC'、'CAB'、'BCA'、'BAC'、'CBA'：

s = 'ABCXABCXXACXXBACXXBCA'

结果为：ABC、ABC、BAC、BCA

字符串t可以是任意长度，可以包含[A-Z]、[a-z]和[0-9]中的任意字符

有没有办法在 Python 中使用正则表达式来获取结果？

我知道我可以构建一个包含所有排列的列表，然后单独搜索列表中的所有项目，但我想知道正则表达式是否可以以更紧凑和更快的方式提供结果。

【问题讨论】：

我不认为正则表达式可以解决这个问题。您可能需要使用滑动窗口算法来查找最坏情况 O(n*a)，其中 n 是字符串的长度，a 是字母表的大小（a = 26 + 26 + 10 = 62 in your案例）
字符串t可以包含重复字符吗？

标签： python regex string permutation

【解决方案1】：

让我画一个算法来解决这个问题。这不是正则表达式解决的问题。

此方案维护一个滑动窗口，并检查窗口中字符的频率与t的频率。

下面是算法的伪代码：

function searchPermutation(inpStr, t):
    // You may want to check t against the regex ^[A-Za-z0-9]+$ here

    // Do a frequency counting of character in t
    // For example, t = 'aABBCCC'
    // Then freq = { 'A': 1, 'B': 2, 'C': 3, 'a': 1 }
    freq = frequency(t)

    // Create an empty dict
    window = {}
    // Number of characters in window
    count = 0
    // List of matches
    result = []

    for (i = 0; i < inpStr.length; i++):
        // If the current character is a character in t
        if inpStr[i] in freq:
            // Add the character at current position
            window[inpStr[i]]++

            // If number of character in window is equal to length of t
            if count == t.length:
                // Remove the character at the end of the window
                window[inpStr[i - t.length]]--
                // The count is kept the same here
            else: // Otherwise, increase the count
                count++

            // If all frequencies in window is the same as freq
            if count == t.length and window == freq:
                // Add to the result a match at (i - t.length + 1, i + 1)
                // We can retrieve the string later with substring
                result.append((i - t.length + 1, i + 1))

                // Reset the window and count (prevent overlapping match)
                // Remove the 2 line below if you want to include overlapping match
                window = {}
                count = 0
        else: // If current character not in t
            // Reset the window and count
            window = {}
            count = 0

    return result

这应该可以解决任何t 的一般问题。

【讨论】：

谢谢。经过几次正则表达式实验后，我同意最好的方法是使用滑动窗口。
@cygnusxr1：如果上述算法中存在任何错误（例如关闭 1），请随时发表评论。我没有写任何实际的代码来测试这个，但这个想法应该是正确的。
(回想起来，window == freq 检查是昂贵的。我们可以保留 (char to list of index) 的映射来确定要跳转到的最后一个索引，因为我们检查添加的字符是否'在循环的每一步都不要超过限制）

【解决方案2】：

正则表达式解决方案：

([ABC])(?!\1)([ABC])(?!\1)(?!\2)[ABC]

【讨论】：