【问题标题】:Algorithm for simple string compression简单字符串压缩算法
【发布时间】:2026-02-18 03:40:01
【问题描述】:

我想为以下形式的字符串找到最短的编码:

abbcccc = a2b4c

【问题讨论】:

  • 这不是以单个a 开头。继续重复下一个字符的两倍。停在c? “唯一需要的信息”是最多 2**26 个字符的字符串的“停止字符”——除了 decompressor/expanderKolmogorov complexity.

标签: algorithm encoding compression


【解决方案1】:

[注意:此贪心算法不保证最短解]

通过记住所有先前出现的字符,可以直接找到重复字符串的第一次出现(包括所有重复的最小结束索引 = 所有重复后的最大剩余字符串)并将其替换为 RLE(Python3 代码):

def singleRLE_v1(s):
    occ = dict() # for each character remember all previous indices of occurrences
    for idx,c in enumerate(s):
        if not c in occ: occ[c] = []
        for c_occ in occ[c]:
            s_c = s[c_occ:idx]
            i = 1
            while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
                i += 1
            if i > 1:
                rle_pars = ('(',')') if len(s_c) > 1 else ('','')
                rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
                s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
                return s_RLE
        occ[c].append(idx)

    return s # no repeating substring found

为了使其对迭代应用程序具有鲁棒性,我们必须排除一些可能不应用 RLE 的情况(例如 '11' 或 '))'),我们还必须确保 RLE 不会使字符串变长(这可能发生在两个字符的子字符串中,在 'abab' 中出现两次):

def singleRLE(s):
    "find first occurrence of a repeating substring and replace it with RLE"
    occ = dict() # for each character remember all previous indices of occurrences
    for idx,c in enumerate(s):
        if idx>0 and s[idx-1] in '0123456789': continue # no RLE for e.g. '11' or other parts of previous inserted RLE
        if c == ')': continue # no RLE for '))...)'

        if not c in occ: occ[c] = []
        for c_occ in occ[c]:
            s_c = s[c_occ:idx]
            i = 1
            while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
                i += 1
            if i > 1:
                print("found %d*'%s'" % (i,s_c))
                rle_pars = ('(',')') if len(s_c) > 1 else ('','')
                rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
                if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
                    s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
                    return s_RLE
        occ[c].append(idx)

    return s # no repeating substring found

现在我们可以安全地在之前的输出中调用singleRLE,只要我们找到一个重复的字符串:

def iterativeRLE(s):
    s_RLE = singleRLE(s)
    while s != s_RLE:
        print(s_RLE)
        s, s_RLE = s_RLE, singleRLE(s_RLE)
    return s_RLE

通过上面插入的print 语句,我们得到例如以下跟踪和结果:

>>> iterativeRLE('xyabcdefdefabcdefdef')
found 2*'def'
xyabc2(def)abcdefdef
found 2*'def'
xyabc2(def)abc2(def)
found 2*'abc2(def)'
xy2(abc2(def))
'xy2(abc2(def))'

但是这个贪心算法对于这个输入失败了:

>>> iterativeRLE('abaaabaaabaa')
found 3*'a'
ab3abaaabaa
found 3*'a'
ab3ab3abaa
found 2*'b3a'
a2(b3a)baa
found 2*'a'
a2(b3a)b2a
'a2(b3a)b2a'

而最短的解决方案之一是3(ab2a)

【讨论】:

  • 如果你有一个字符串说 tctttttttttttcttttttttttctttttttttttttct,那么这段代码将返回长度为 15 的 tc11tc10tc11tct。但是,有一个更好的编码 tc11tc2(t9tct) 长度为 14。
  • @q85ts 正确。在这个答案和我的其他答案中查看我的注释。
【解决方案2】:

由于贪心算法不起作用,因此需要进行一些搜索。这是带有一些修剪的深度优先搜索(如果在分支中未触及字符串的第一个 idx0 字符,则不要尝试在这些字符中找到重复的子字符串;如果替换多次出现的子字符串,请执行此操作所有连续发生):

def isRLE(s):
    "is this a well nested RLE? (only well nested RLEs can be further nested)"
    nestCnt = 0
    for c in s:
        if c == '(':
            nestCnt += 1
        elif c == ')':
            if nestCnt == 0:
                return False
            nestCnt -= 1
    return nestCnt == 0

def singleRLE_gen(s,idx0=0):
    "find all occurrences of a repeating substring with first repetition not ending before index idx0 and replace each with RLE"
    print("looking for repeated substrings in '%s', first rep. not ending before index %d" % (s,idx0))
    occ = dict() # for each character remember all previous indices of occurrences
    for idx,c in enumerate(s):
        if idx>0 and s[idx-1] in '0123456789': continue # sub-RLE cannot start after number

        if not c in occ: occ[c] = []
        for c_occ in occ[c]:
            s_c = s[c_occ:idx]
            if not isRLE(s_c): continue # avoid RLEs for e.g. '))...)'
            if idx+len(s_c) < idx0: continue # pruning: this substring has been tried before
            if c_occ-len(s_c) >= 0 and s[c_occ-len(s_c):c_occ] == s_c: continue # pruning: always take all repetitions
            i = 1
            while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
                i += 1
            if i > 1:
                rle_pars = ('(',')') if len(s_c) > 1 else ('','')
                rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
                if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
                    s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
                    #print("  replacing %d*'%s' -> %s" % (i,s_c,s_RLE))
                    yield s_RLE,c_occ
        occ[c].append(idx)

def iterativeRLE_depthFirstSearch(s):
    shortestRLE = s
    candidatesRLE = [(s,0)]
    while len(candidatesRLE) > 0:
        candidateRLE,idx0 = candidatesRLE.pop(0)
        for rle,idx in singleRLE_gen(candidateRLE,idx0):
            if len(rle) <= len(shortestRLE):
                shortestRLE = rle
                print("new optimum: '%s'" % shortestRLE)
            candidatesRLE.append((rle,idx))
    return shortestRLE

样本输出:

>>> iterativeRLE_depthFirstSearch('tctttttttttttcttttttttttctttttttttttct')
looking for repeated substrings in 'tctttttttttttcttttttttttctttttttttttct', first rep. not ending before index 0
new optimum: 'tc11tcttttttttttctttttttttttct'
new optimum: '2(tctttttttttt)ctttttttttttct'
new optimum: 'tctttttttttttc2(ttttttttttct)'
looking for repeated substrings in 'tc11tcttttttttttctttttttttttct', first rep. not ending before index 2
new optimum: 'tc11tc10tctttttttttttct'
new optimum: 'tc11t2(ctttttttttt)tct'
new optimum: 'tc11tc2(ttttttttttct)'
looking for repeated substrings in 'tc5(tt)tcttttttttttctttttttttttct', first rep. not ending before index 2
...
new optimum: '2(tctttttttttt)c11tct'
...
new optimum: 'tc11tc10tc11tct'
...
new optimum: 'tc11t2(c10t)tct'
looking for repeated substrings in 'tc11tc2(ttttttttttct)', first rep. not ending before index 6
new optimum: 'tc11tc2(10tct)'
...    
new optimum: '2(tc10t)c11tct'
...    
'2(tc10t)c11tct'

【讨论】:

    【解决方案3】:

    以下是我的 C++ 实现,以O(n) 时间复杂度和O(1) 空间复杂度就地完成。

    class Solution {
    public:
        int compress(vector<char>& chars) {
            int n = (int)chars.size();
            if(chars.empty()) return 0;
            int left = 0, right = 0, currCharIndx = left;
            while(right < n) {
                if(chars[currCharIndx] != chars[right]) {
                    int len = right - currCharIndx;
                    chars[left++] = chars[currCharIndx];
                    if(len > 1) {
                        string freq = to_string(len);
                        for(int i = 0; i < (int)freq.length(); i++) {
                            chars[left++] = freq[i];
                        }
                    }
                    currCharIndx = right;
                }
                right++;
            }
            int len = right - currCharIndx;
            chars[left++] = chars[currCharIndx];
            if(len > 1) {
                string freq = to_string(len);
                for(int i = 0; i < freq.length(); i++) {
                    chars[left++] = freq[i];
                }
            }
            return left;
        }
    };
    

    您需要跟踪三个指针 - right 是迭代,currCharIndx 是跟踪当前字符的第一个位置,left 是跟踪压缩字符串的写入位置。

    【讨论】: