最大化子字符串的数量，使得没有子字符串具有来自其他子字符串的字符答案

【问题标题】：Maximize number of substring such that no substring has characters from other substring最大化子字符串的数量，使得没有子字符串具有来自其他子字符串的字符
【发布时间】：2020-07-23 12:47:12
【问题描述】：

所以我最近被问到一个与字符串和子字符串有关的有趣问题。仍在努力获得对此的最佳答案。我更喜欢用 Java 回答，尽管任何伪代码/语言也可以。

问题是：

我得到一个字符串 S。我必须将它分成最大数量的子字符串（不是子序列），这样子字符串就没有另一个子字符串中存在的字符。

例子：

1.
   S = "aaaabbbcd"
   Substrings = ["aaaa","bbb","c","d"]

2.
   S = "ababcccdde"
   Substrings = ["abab","ccc","dd","e"]

3.
   S = "aaabbcccddda"
   Substrings = ["aaabbcccddda"]

如果我能得到比O(n^2)更好的解决方案，我会很高兴

感谢您的帮助。

【问题讨论】：

标签： string algorithm data-structures substring

【解决方案1】：

可以在 O(n) 时间内完成。

其背后的想法是预测每个子字符串的结束位置。我们知道，如果我们读取一个 char，那么这个 char 的最后一次出现必须在同一个子字符串中（否则在两个不同的子字符串中会有一个重复的 char）。

我们以abbacacd 为例。假设我们知道字符串中每个字符的第一次和最后一次出现。

01234567
abbacacd   (reading a at index 0)

- we know that our substring must be at least abbaca (last occurrence of a);
- the end of our substring will be the maximum between the last occurrence of 
  all the chars inside the own substring;
- we iterate through the substring:

012345     (we found b at index 1)
abbaca      substring_end = maximum(5, last occurrence of b = 2)
            substring_end = 5.

012345     (we found b at index 2)
abbaca      substring_end = maximum(5, last occurrence of b = 2)
            substring_end = 5.

012345     (we found a at index 3)
abbaca      substring_end = maximum(5, last occurrence of a = 5)
            substring_end = 5.

012345     (we found c at index 4)
abbaca      substring_end = maximum(5, last occurrence of c = 6)
            substring_end = 6.

0123456    (we found a at index 5)
abbacac     substring_end = maximum(6, last occurrence of a = 5)
            substring_end = 6.

0123456    (we found c at index 6)
abbacac     substring_end = maximum(6, last occurrence of c = 6)
            substring_end = 6. 

---END OF FIRST SUBSTRING---

01234567
abbacacd           [reading d]

- the first and last occurrence of d is the same index.
- d is an atomic substring.

O(n) 解是：

#include <bits/stdc++.h>

using namespace std;

int main(){
    int pos[26][2];
    int index;
    memset(pos, -1, sizeof(pos));
    string s = "aaabbcccddda";

    for(int i = 0; i < s.size(); i++){
        index = s[i] - 'a';
        if(pos[index][0] == -1) pos[index][0] = i;
        pos[index][1] = i;
    }

    int substr_end;
    for(int i = 0; i < s.size(); i++){
        index = s[i] - 'a';
        if(pos[index][0] == pos[index][1]) cout<<s[i]<<endl;
        else{
            substr_end = pos[index][1];
            for(int j = i + 1; j < substr_end; j++){
                substr_end = max(substr_end, pos[s[j] - 'a'][1]);
            }
            cout<<s.substr(i, substr_end - i + 1)<<endl;
            i = substr_end;
        }
    }
}

【讨论】：

谢谢。工作完美。在我的脑海中，我一直在思考内循环。从来没有想过 i=substr_end 会完成这个 O(n)。

【解决方案2】：

您可以通过两次传递来完成。首先，您确定字符串中每个字符的最大索引。第二日，您跟踪每个遇到的字符的最大索引。如果最大值等于当前索引，则您已到达唯一子字符串的末尾。

这里有一些 Java 代码来说明：

char[] c = "aaaabbbcd".toCharArray();

int[] max = new int[26];        
for(int i=0; i<c.length; i++) max[c[i]-'a'] = i;;

for(int i=0, m=0, lm=0; i<c.length;)
  if((m = Math.max(m, max[c[i]-'a'])) == i++) 
    System.out.format("%s ", s.substring(lm, lm = i));

输出：

aaaa bbb c d

对于其他 2 个字符串：

abab ccc dd e 
aaabbcccddda

【讨论】：

【解决方案3】：

接受的答案在算法的实现中包含一些不必要的复杂性。将字符串（如相关 OP 发布的示例）划分为最大数量的子字符串非常简单，这样没有子字符串具有另一个子字符串中存在的字符。

算法：
（假设：输入字符串是一个非空字符串，在'a' 到'z' 内包含1 或更多字符）

记录输入字符串每个字符的最后位置。
假设，第一个子串结束位置是0。
遍历字符串和输入字符串中的每个字符-

一）。如果当前字符最后位置大于子字符串结束位置，则将子字符串结束位置更新为当前字符最后位置。
乙）。添加（或打印）当前字符处理作为当前子字符串的一部分。
C）。如果子字符串结束位置等于当前字符处理的位置，则它是唯一子字符串的结尾，并且从下一个字符开始新的子字符串。
重复3，直到输入字符串结束。

实施：

#include <stdio.h>
#include <string.h>

void unique_substr(const char * pst) {
    size_t ch_last_pos[26] = {0};
    size_t subst_end_pos = 0;
    size_t len = strlen(pst);

    printf ("%s -> ", pst);
    for (size_t i = 0; i < len; i++) {
        ch_last_pos[pst[i] - 'a'] = i;
    }

    for (size_t i = 0; i < len; i++) {
        size_t pos = ch_last_pos[pst[i] - 'a'];
        if (pos > subst_end_pos) {
            subst_end_pos = pos;
        }

        printf ("%c", pst[i]);

        if (subst_end_pos == i) {
            printf (" ");
        }
    }
    printf ("\n");
}

//Driver program

int main(void) {

    //base cases
    unique_substr ("b");
    unique_substr ("ab");

    //strings posted by OP in question
    unique_substr ("aaaabbbcd");
    unique_substr ("ababcccdde");
    unique_substr ("aaabbcccddda");

    return 0;
}

输出：

# ./a.out
b -> b 
ab -> a b 
aaaabbbcd -> aaaa bbb c d 
ababcccdde -> abab ccc dd e 
aaabbcccddda -> aaabbcccddda

【讨论】：