如何降低遍历字符串的时间复杂度？答案

【问题标题】：How to reduce time complexity in traversing a string?如何降低遍历字符串的时间复杂度？
【发布时间】：2021-06-24 15:56:38
【问题描述】：

我正在解决一个问题，以在字符串s 中查找此类索引的数量a, b, c, d，其大小为n，仅由小写字母组成，例如：

1

和

s[a] == s[c] 和 s[b] == s[d]

我写的代码以基本的方式逐个字符地遍历字符串：

#include<stdio.h>

int main()
{
    int n, count = 0;
    char s[2002];
    scanf("%d%s", &n, s);
    for(int a = 0; a<n-3; a++)
    {
        for(int b = a + 1; b<n-2; b++)
        {
            for(int c = b + 1; c<n-1; c++)
            {
                for(int d = c + 1; d<n; d++)
                {
                    if(s[a] == s[c] && s[b] == s[d] && a>=0 && b>a && c>b && d>c && d<n)
                    {
                        count++;
                    }
                }
            }
        }
    }
    printf("%d", count);
    return 0;
}

a、b、c 和 d 是索引。问题是如果输入的字符串很大，由于 4 个嵌套循环会超出时间限制。有什么办法可以改进代码以降低复杂性？

问题陈述在这里：https://www.hackerearth.com/practice/algorithms/searching/linear-search/practice-problems/algorithm/holiday-season-ab957deb/

【问题讨论】：

在O(n) 中将很难解决这个问题。但是，O(n^2) 是可能的。
虽然问题描述掩盖了它（我想是故意的），但这是一个overlapping intervals 问题。
我确实设法解决了您的问题。我的代码的复杂度是O(n^2)。

标签： c string algorithm search time-complexity

【解决方案1】：

在早期阶段执行相等检查可以节省一些时间。此外，检查 a>=0 && b>a && c>b && d>c && d

#include<stdio.h>

int main()
{
    int n, count = 0;
    char s[2002];
    scanf("%d%s", &n, s);
    for(int a = 0; a<n-3; a++)
    {
        for(int b = a + 1; b<n-2; b++)
        {
            for(int c = b + 1; c<n-1; c++)
            {
                if(s[a] == s[c]) {
                    for(int d = c + 1; d<n; d++)
                    {
                        if(s[b] == s[d])
                        {
                            count++;
                        }
                    }
                }
            }
        }
    }
    printf("%d", count);
    return 0;
}

【讨论】：

谢谢，这个改动确实在时限内解决了更多的测试用例，但是还是有一些超出了时限，虽然只是少量。有什么方法可以合并任意两个循环吗？

【解决方案2】：

在最坏的情况下，整个字符串包含相同的字符，并且在这种情况下，1 <= a < b < c < d <= N 这样的每个索引都将满足s[a] == s[c] && s[b] == s[d]，因此计数器将加起来为n*(n-1)*(n-2)*(n-3) / 4!，即@987654324 @。换句话说，假设计数过程是一对一的（使用counter++），没有办法让最坏情况的时间复杂度比O(n^4)更好。

话虽如此，这个算法可以改进。一项可能且非常重要的改进是，如果s[a] != s[c]，则继续检查所有可能的索引b 和d 是没有意义的。 user3777427 是往这个方向走的，可以这样进一步改进：

for(int a = 0; a < n-3; a++)
{
    for(int c = a + 2; c < n-1; c++)
    {
        if(s[a] == s[c])
        {
            for(int b = a + 1; b < c; b++)
            {
                for(int d = c + 1; d < n; d++)
                {
                    if(s[b] == s[d])
                    {
                        count++;
                    }
                }
            }
        }
    }
}

编辑：

经过深思熟虑，我找到了一种方法，可以通过使用直方图将最差时间复杂度降低到 O(n^3)。

首先，我们遍历char数组一次并填充直方图，这样直方图中的索引'a'将包含'a'的出现次数，直方图中的索引'b'将包含出现的次数'b'等的出现

然后，我们使用直方图来消除对最内层循环（d 循环）的需要，如下所示：

int histogram1[256] = {0};
for (int i = 0; i < n; ++i)
{
    ++histogram1[(int) s[i]];
}

int histogram2[256];

for(int a = 0; a < n-3; a++)
{
    --histogram1[(int) s[a]];
    
    for (int i = 'a'; i <= 'z'; ++i)
    {
        histogram2[i] = histogram1[i];
    }

    --histogram2[(int) s[a+1]];

    for (int c = a + 2; c < n-1; c++)
    {
        --histogram2[(int) s[c]];

        for (int b = a + 1; b < c; b++)
        {
            if (s[a] == s[c])
            {
                count += histogram2[(int) s[b]];
            }
        }
    }
}

【讨论】：

谢谢，改进在约定时间内解决了更多的测试用例，但时限问题依然存在，不过改变逻辑流程的建议很有用
@rohan843 我编辑并添加了一个改进的解决方案。
@rohan843 我之前改进的解决方案有一个错误，现在我编辑并输入了我测试并确保正常工作的代码。

【解决方案3】：

由于字符串 S 仅由小写字母组成，因此您可以维护一个 26x26 表（实际上是 25x25，当 i=j 时忽略），该表包含所有可能不同的两个字母大小写（例如 ab、ac、bc 等）的外观)。

以下代码通过两个函数跟踪每个候选答案（abab、acac、bcbc 等）的完整性：检查 AC 位置和检查 BD 位置。一旦该值达到 4，则表示候选者是有效答案。

#include <stdio.h>

int digitsAC(int a)
{
    if(a % 2 == 0)
        return a + 1;
    return a;
}

int digitsBD(int b)
{
    if(b % 2 == 1)
        return b + 1;
    return b;
}

int main()
{
    int n, count = 0;
    char s[2002];
    int appearance2x2[26][26] = {0};
    scanf("%d%s", &n, s);
    for(int i = 0; i < n; ++i)
    {
        int id = s[i] - 'a';
        for(int j = 0; j < 26; ++j)
        {
            appearance2x2[id][j] = digitsAC(appearance2x2[id][j]);
            appearance2x2[j][id] = digitsBD(appearance2x2[j][id]);  
        }
    }
    //counting the results
    for(int i = 0; i < 26; ++i)
    {
        for(int j = 0; j < 26; ++j)
        {
            if(i == j)continue;
            if(appearance2x2[i][j] >= 4)count += ((appearance2x2[i][j] - 2) / 2);
        }
    }
    printf("%d", count);
    return 0;
}

时间复杂度为O(26N)，等于线性。通过进行按位掩码操作可以进一步加速代码，但为了清楚起见，我将函数保持简单。没有测试很多，如果发现有bug请告诉我！

编辑：处理连续出现的字母如 aabbaabb 时存在问题

【讨论】：

使用辅助数组的想法很好，但是这段代码与原始代码（以及问题所问的）不同，我认为：你计算可能的字母组合，但是原始计数可能从字符串中“挑选”，其中“ababababa”应该有两个以上的命中。
哇，我好像弄错了问题。但我觉得计数部分的调整应该可以完成这项工作。运行几个测试用例后，我将编辑我的答案。
该程序不适用于输入字符串ababab。计数应该是6，而产生的输出是3。

【解决方案4】：

如果你维护一个数组来存储输入字符串中每个字符的累积频率（频率分布中的频率和所有频率的总和），这个问题就可以解决。由于字符串将仅包含小写字符，因此 数组大小将为 [26][N+1]。

例如：

index  - 1 2 3 4 5
string - a b a b a

cumulativeFrequency array:

    0  1  2  3  4  5
a   0  1  1  2  2  3
b   0  0  1  1  2  2

我通过将输入字符串的第一个字符的索引设为 1 来制作数组。这样做有助于我们以后解决问题。 现在，只需忽略第 0 列，并假设字符串从索引 1 开始，而不是 0。

有用的事实

使用累积频率数组，我们可以轻松地检查任何索引 i 处是否存在字符：

if cumulativeFrequency[i]-cumulativeFrequency[i-1] > 0

一个字符在 i 到 j 范围内出现的次数（不包括 i 和 j）：

frequency between i and j =  cumulativeFrequency[j-1] - cumulativeFrequency[i]

算法

1: for each character from a-z:
2:     Locate index a and c such that charAt[a] == charAt[c]
3:     for each pair (a, c):
4:         for character from a-z:
5:             b = frequency of character between a and c
6:             d = frequency of character after c
7:             count += b*d

时间复杂度

第 1-2 行：

最外面的循环将运行 26 次。我们需要找到所有 pair(a, c)，为此我们需要 O(n^2) 的时间复杂度。

第 3-4 行：

对于每一对，我们再次运行一个循环 26 次，以检查每个字符在 a 和 c 之间以及 c 之后出现的次数。

第 5-7 行：

使用累积频率数组，对于每个字符，我们可以轻松计算它在 O(1) 中在 a 和 c 之间以及 c 之后出现的次数。

因此，总体复杂度为O(26*n^2*26) = O(n^2)。

代码

我用 Java 编写代码。我没有 C 代码。我使用了简单的循环数组，所以应该很容易理解。

//Input N and string 
//Do not pay attention to the next two lines since they are basically taking 
//input using Java input streams
int N = Integer.parseInt(bufferedReader.readLine().trim());
String str = bufferedReader.readLine().trim();

//Construct an array to store cumulative frequency of each character in the string
int[][] cumulativeFrequency = new int[26][N+1];

//Fill the cumulative frequency array
for (int i = 0;i < str.length();i++)
{
    //character an index i
    char ch = str.charAt(i);

    //Fill the cumulative frequency array for each character 
    for (int j = 0;j < 26;j++)
    {
        cumulativeFrequency[j][i+1] += cumulativeFrequency[j][i];
        if (ch-97 == j) cumulativeFrequency[j][i+1]++;
    }
}

int a, b, c, d;
long count = 0;

//Follow the steps of the algorithm here
for (int i = 0;i < 26;i++)
{
    for (int j = 1; j <= N - 2; j++)
    {
        //Check if character at i is present at index j
        a = cumulativeFrequency[i][j] - cumulativeFrequency[i][j - 1];

        if (a > 0)
        {
            //Check if character at i is present at index k
            for (int k = j + 2; k <= N; k++)
            {
                c = cumulativeFrequency[i][k] - cumulativeFrequency[i][k - 1];

                if (c > 0)
                {
                    //For each character, find b*d
                    for (int l = 0; l < 26; l++)
                    {
                        //For each character calculate b and d
                        b = cumulativeFrequency[l][k-1] - cumulativeFrequency[l][j];
                        d = cumulativeFrequency[l][N] - cumulativeFrequency[l][k];

                        count += b * d;
                        }
                    }
                }
            }
        }
    }

    System.out.println(count);

希望我对你有所帮助。 我提供的代码不会给出时间复杂度错误，它适用于所有测试用例。 如果你不明白我的解释中的任何内容，请发表评论。

【讨论】：

这是一个很好的方法。鉴于我写了一个根本没有提供代码的答案（因为 OP 应该编写自己的，IMO），我真的不能因为在 C 问题中提供 Java 代码而责怪它。
@JohnBollinger 感谢您的反馈。我确实只用 Java 代码编写了简单的循环。如果他们真的想知道代码是如何工作的，任何人仍然需要使用笔和纸。哈哈。

【解决方案5】：

问题

认识到这是一个计算重叠间隔的练习，这对于思考这个问题可能是有用的。例如，如果我们将输入中的每一对相同字符视为标记半开区间的端点，那么问题是要求计算重叠的区间对的数量，而其中一个不是另一个的子集。

算法

解决问题的一种方法是从识别和记录所有间隔开始。这样做很简单，允许间隔按左端点分组并按每个组内的右端点排序 - 这很容易从具有两级循环嵌套的输入的幼稚扫描中掉出来。

这样的间隔组织对于减少重叠的搜索空间和更有效地计算重叠都很方便。特别是，可以这样进行计数：

对于每个区间 I，请严格考虑 I 的端点之间的左端点的区间组。
在所考虑的每个组中，对右端点比 I 的右端点大一的区间或出现此类区间的位置执行二分搜索。
该组从该点到最后的所有成员都满足重叠标准，因此将该数字添加到总数中。

复杂性分析

可以通过两级循环嵌套以 O(n²) 成本创建排序间隔列表和组大小/边界。总共可能有多达n * (n - 1) 个间隔，发生在所有输入字符相同时，因此该列表需要 O(n²) 存储空间。

间隔被精确地分组为n - 1 组，其中一些可能是空的。对于每个区间 (O(n²))，我们最多考虑其中的 n - 2，并执行二分查找 (O(log n)) 在每一个上。这会产生 O(n³ log n) 个整体操作。

这是对原始算法的 O(n⁴) 成本的算法改进，尽管改进的渐近复杂度是否体现了改进的性能还有待观察正在测试特定的问题规模。

【讨论】：

你链接到一个网页，上面说有一个 O(n log n) 解决方案，但你展示了一个 O(n^3 log n) 算法？此外，该网页有一个 O(n log n) 解决方案，因为它需要首先对间隔进行排序。我们得到一个字符串，其间隔基本上被标记（每个字符在它所在的位置开始和结束一个潜在的间隔），所以我们不需要排序。有一个 O(n) 的解决方案（在实际复杂度中，计算字长乘法等单步）。

【解决方案6】：

这是一个 O(n) 的解决方案（将允许的字符集中的字符数计算为常量）。

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>


/*  As used in this program, "substring" means a string that can be formed by
    characters from another string.  The resulting characters are not
    necessarily consecutive in the original string.  For example, "ab" is a
    substring of "xaxxxxbxx".

    This program requires the lowercase letters to have consecutive codes, as
    in ASCII.
*/


#define Max   2000      //  Maximum string length supported.
typedef short     T1;   //  A type that can hold Max.
typedef int       T2;   //  A type that can hold Max**2.
typedef long      T3;   //  A type that can hold Max**3.
typedef long long T4;   //  A type that can hold Max**4.
#define PRIT4 "lld"     //  A conversion specification that will print a T4.

#define L   ('z'-'a'+1) //  Number of characters in the set allowed.


/*  A Positions structure records all positions of a character in the string.
    N is the number of appearances, and Position[i] is the position (index into
    the string) of the i-th appearance, in ascending order.
*/
typedef struct { T1 N, Position[Max]; } Positions;


/*  Return the number of substrings "aaaa" that can be formed from "a"
    characters in the positions indicated by A.
*/
static T4 Count1(const Positions *A)
{
    T4 N = A->N;
    return N * (N-1) * (N-2) * (N-3) / (4*3*2*1);
}


/*  Return the number of substrings "abab" that can be formed from "a"
    characters in the positions indicated by A and "b" characters in the
    positions indicated by B.  A and B must be different.
*/
static T4 Count2(const Positions *A, const Positions *B)
{
    //  Exit early for trivial cases.
    if (A->N < 2 || B->N < 2)
        return 0;

    /*  Sum[i] will record the number of "ab" substrings that can be formed
        with a "b" at the position in B->Position[b] or earlier.
    */
    T2 Sum[Max];

    T3 RunningSum = 0;

    /*  Iterate b through the indices of B->Position.  While doing this, a is
        synchronized to index to a corresponding place in A->Position.
    */
    for (T1 a = 0, b = 0; b < B->N; ++b)
    {
        /*  Advance a to index into A->Position where where A->Position[i]
            first exceeds B->Position[b], or to the end if there is no such
            spot.
        */
        while (a < A->N && A->Position[a] < B->Position[b])
            ++a;

        /*  The number of substrings "ab" that can be formed using the "b" at
            position B->Position[b] is a, the number of "a" preceding it.
            Adding this to RunningSum produces the number of substrings "ab"
            that can be formed using this "b" or an earlier one.
        */
        RunningSum += a;

        //  Record that.
        Sum[b] = RunningSum;
    }

    RunningSum = 0;

    /*  Iterate a through the indices of A->Position.  While doing this, b is
        synchronized to index to a corresponding place in B->Position.
    */
    for (T1 a = 0, b = 0; a < A->N; ++a)
    {
        /*  Advance b to index into B->Position where where B->Position[i]
            first exceeds A->Position[a], or to the end if there is no such
            spot.
        */
        while (b < B->N && B->Position[b] < A->Position[a])
            ++b;

        /*  The number of substrings "abab" that can be formed using the "a"
            at A->Position[a] as the second "a" in the substring is the number
            of "ab" substrings that can be formed with a "b" before the this
            "a" multiplied by the number of "b" after this "a".

            That number of "ab" substrings is in Sum[b-1], if 0 < b.  If b is
            zero, there are no "b" before this "a", so the number is zero.

            The number of "b" after this "a" is B->N - b.
        */
        if (0 < b) RunningSum += (T3) Sum[b-1] * (B->N - b);
    }

    return RunningSum;
}


int main(void)
{
    //  Get the string length.
    size_t length;
    if (1 != scanf("%zu", &length))
    {
        fprintf(stderr, "Error, expected length in standard input.\n");
        exit(EXIT_FAILURE);
    }

    //  Skip blanks.
    int c;
    do
        c = getchar();
    while (c != EOF && isspace(c));
    ungetc(c, stdin);

    /*  Create an array of Positions, one element for each character in the
        allowed set.
    */
    Positions P[L] = {{0}};

    for (size_t i = 0; i < length; ++i)
    {
        c = getchar();
        if (!islower(c))
        {
            fprintf(stderr,
"Error, malformed input, expected only lowercase letters in the string.\n");
            exit(EXIT_FAILURE);
        }
        c -= 'a';
        P[c].Position[P[c].N++] = i;
    }

    /*  Count the specified substrings.  i and j are iterated through the
        indices of the allowed characters.  For each pair different i and j, we
        count the number of specified substrings that can be performed using
        the character of index i as "a" and the character of index j as "b" as
        described in Count2.  For each pair where i and j are identical, we
        count the number of specified substrings that can be formed using the
        character of index i alone.
    */
    T4 Sum = 0;
    for (size_t i = 0; i < L; ++i)
        for (size_t j = 0; j < L; ++j)
            Sum += i == j
                ? Count1(&P[i])
                : Count2(&P[i], &P[j]);

    printf("%" PRIT4 "\n", Sum);
}

【讨论】：