在 Python3 中测量字符串压缩率的最快方法答案

【问题标题】：Fastest way to measure compression ratio of a string in Python3在 Python3 中测量字符串压缩率的最快方法
【发布时间】：2019-02-25 15:57:36
【问题描述】：

我想通过用 LZMA 压缩它们并获取压缩比来估计短字符串（大约一个字长）的 Kolmogorov 复杂度。

在 Python3 中最有效的方法是什么？

【问题讨论】：

嗯，您确实意识到压缩短字符串会增加指定它的位数，因为您需要包含解压缩算法？
来自维基百科：simply compress the string s with some method, implement the corresponding decompressor in the chosen language, concatenate the decompressor to the compressed string, and measure the length of the resulting string.
@RishavKundu 问题是关于实施，而不是理论。

标签： string python-3.x lzma

【解决方案1】：

编辑：

我不确定这是否是估计短字符串复杂度的好方法，因为要正确计算字符串的 Kolmogorov (K-) 复杂度，我们必须考虑用于解压缩字符串的程序的长度.程序的长度（我的 Debian 笔记本电脑上的 xz 5.1.0 为 67k）将压倒短字符串。因此，以下程序更接近于计算 K 复杂度上限：

import lzma #For python 2.7 use backports.lzma

program_length = 67000

def lzma_compression_ratio(test_string):
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = lzma.compress(bytes_in)
    lbi = len(bytes_in)
    lbo = len(bytes_out)+program_length
    ratio = lbo/lbi
    message = '%d bytes compressed to %d bytes, ratio %0.3f'%(lbi,lbo,ratio)
    print(message)
    return ratio

test_string = 'a man, a plan, a canal: panama'
lzma_compression_ratio(test_string)

for n in range(22,25):
    test_string = 'a'*(2**n)
    lzma_compression_ratio(test_string)

下面的输出显示，对于 30 个 a 的字符串，压缩率超过 2000，对于长度为 2^23 的重复字符串，压缩率低于 0.01。这些在技术上是 K 复杂度的正确上限，但显然对短字符串没有用。程序“print('a'*30)”的长度为 13，它给出了字符串 'aaaaaaaaaaaaaaaaaaaaaaaaaaaa' 的 K 复杂度上限为 0.43 (13/30)。

30 bytes compressed to 67024 bytes, ratio 2234.133
4194304 bytes compressed to 67395 bytes, ratio 0.016
8388608 bytes compressed to 68005 bytes, ratio 0.008
16777216 bytes compressed to 69225 bytes, ratio 0.004

原答案：

@Superbest，这个好像可以，不知道是不是效率最高的：

import lzma

def lzma_compression_ratio(test_string):
    c = lzma.LZMACompressor()
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = c.compress(bytes_in)
    return len(bytes_out)/len(bytes_in)

test_string = 'a man, a plan, a canal: panama'
compression_ratio = lzma_compression_ratio(test_string)
print(compression_ratio)

【讨论】：

你在float 中包裹了len(...) 什么？
另外，如果我复制最后 3 行（实际上要求它计算相同的字符串两次），它首先给出 0.8，然后给出 0.0。
@Superbest, float 因为我是 Python 2.7 用户。
问题被标记为python-3.x
你忘了包含LZMA模块的大小，那个是实际执行压缩/解压的程序:p