为什么 Python 和 wc 在字节数上存在分歧？答案

【问题标题】：Why do Python and wc disagree on byte count?为什么 Python 和 wc 在字节数上存在分歧？
【发布时间】：2016-08-26 14:49:06
【问题描述】：

Python 和 wc 在给定字符串的字节数（长度）上存在很大分歧：

with open("commedia.pfc", "w") as f:
    t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
    print(len(t))
    f.write(t)

Output : 318885

$> wc commedia.pfc
  2181  12282 461491 commedia.pfc

该文件主要由不可读的字符组成，因此我将提供一个 hexdump：

http://www.filedropper.com/dump_2

该文件是无前缀压缩的结果，如果您问我可以提供生成它的完整代码以及输入文本。

为什么两个字节数不相等？

我添加了压缩算法的完整代码，看起来很长，但是文档和测试满满，应该很容易理解：

"""
Implementation of prefix-free compression and decompression.
"""
import doctest
from itertools import islice
from collections import Counter
import random
import json

def binary_strings(s):
    """
    Given an initial list of binary strings `s`,
    yield all binary strings ending in one of `s` strings.

    >>> take(9, binary_strings(["010", "111"]))
    ['010', '111', '0010', '1010', '0111', '1111', '00010', '10010', '01010']
    """
    yield from s
    while True:
        s = [b + x for x in s for b in "01"]
        yield from s

def take(n, iterable):
    """
    Return first n items of the iterable as a list.
    """
    return list(islice(iterable, n))

def chunks(xs, n, pad='0'):
    """
    Yield successive n-sized chunks from xs.
    """
    for i in range(0, len(xs), n):
        yield xs[i:i + n]

def reverse_dict(dictionary):
    """
    >>> sorted(reverse_dict({1:"a",2:"b"}).items())
    [('a', 1), ('b', 2)]
    """
    return {value : key for key, value in dictionary.items()}

def prefix_free(generator):
    """
    Given a `generator`, yield all the items from it
    that do not start with any preceding element.

    >>> take(6, prefix_free(binary_strings(["00", "01"])))
    ['00', '01', '100', '101', '1100', '1101']
    """
    seen = []
    for x in generator:
        if not any(x.startswith(i) for i in seen):
            yield x
            seen.append(x)

def build_translation_dict(text, starting_binary_codes=["000", "100","111"]):
    """
    Builds a dict for `prefix_free_compression` where
       More common char -> More short binary strings
    This is compression as the shorter binary strings will be seen more times than
    the long ones.

    Univocity in decoding is given by the binary_strings being prefix free.

    >>> sorted(build_translation_dict("aaaaa bbbb ccc dd e", ["01", "11"]).items())
    [(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
    """
    binaries = sorted(list(take(len(set(text)), prefix_free(binary_strings(starting_binary_codes)))), key=len)
    frequencies = Counter(text)
    # char value tiebreaker to avoid non-determinism                     v
    alphabet = sorted(list(set(text)), key=(lambda ch: (frequencies[ch], ch)), reverse=True)
    return dict(zip(alphabet, binaries))

def prefix_free_compression(text, starting_binary_codes=["000", "100","111"]):
    """
    Implements `prefix_free_compression`, simply uses the dict
    made with `build_translation_dict`.

    Returns a tuple (compressed_message, tranlation_dict) as the dict is needed
    for decompression.

    >>> prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])[0]
    '010101010100111111111001101101101001000100010011001'
    """
    translate = build_translation_dict(text, starting_binary_codes)
    # print(translate)
    return ''.join(translate[i] for i in text), translate

def prefix_free_decompression(compressed, translation_dict):
    """
    Decompresses a prefix free `compressed` message in the form of a string
    composed only of '0' and '1'.

    Being the binary codes prefix free,
    the decompression is allowed to take the earliest match it finds.

    >>> message, d = prefix_free_compression("aaaaa bbbb ccc dd e", ["01", "11"])
    >>> message
    '010101010100111111111001101101101001000100010011001'
    >>> sorted(d.items())
    [(' ', '001'), ('a', '01'), ('b', '11'), ('c', '101'), ('d', '0001'), ('e', '1001')]
    >>> ''.join(prefix_free_decompression(message, d))
    'aaaaa bbbb ccc dd e'
    """
    decoding_translate = reverse_dict(translation_dict)
    # print(decoding_translate)
    word = ''
    for bit in compressed:
        # print(word, "-", bit)
        if word in decoding_translate:
            yield decoding_translate[word]
            word = ''
        word += bit
    yield decoding_translate[word]


if __name__ == "__main__":
    doctest.testmod()
    with open("commedia.txt") as f:
        text = f.read()
    compressed, d = prefix_free_compression(text)
    with open("commedia.pfc", "w") as f:
        t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))
        print(len(t))
        f.write(t)
    with open("commedia.pfcd", "w") as f:
        f.write(json.dumps(d))
    # dividing by 8 goes from bit length to byte length
    print("Compressed / uncompressed ratio is {}".format((len(compressed)//8) / len(text)))
    original = ''.join(prefix_free_decompression(compressed, d))
    assert original == text

commedia.txt 是 filedropper.com/commedia

【问题讨论】：

是的，你能提供输入文本和代码吗？
@Tiger-222 算法的输入文件是：filedropper.com/commedia
@Tiger-222 我添加了完整的代码以备不时之需
请注意，print(len(''.join(chr(i) for i in range(256)).encode('utf8'))) 打印 384。如果您删除 'utf8' 参数，.encode 方法将使用您系统的默认编码。

标签： python file python-3.x byte wc

【解决方案1】：

您正在使用 Python3 和 str 对象 - 这意味着您在 len(t) 中看到的计数是字符串中的 字符数。现在，字符不是字节 - and it is so since the 90's 。

由于您没有声明显式文本编码，因此文件写入使用系统默认编码对您的文本进行编码 - 在 Linux 或 Mac OS X 上将是 utf-8 - 一种编码，其中任何字符超出ASCII 范围 (ord(ch) > 127) 在磁盘上使用多个字节。

所以，你的程序基本上是错误的。首先，定义你是在处理 text 还是 bytes 。如果您与字节对齐，请打开文件以二进制模式写入（wb，而不是w）并更改此行：

t = ''.join(chr(int(b, base=2)) for b in chunks(compressed, 8))

到

t = bytes((int(b, base=2) for b in chunks(compressed, 8))

这样很明显，您正在处理字节本身，而不是破坏字符和字节。

当然，有一个丑陋的解决方法可以对字节对象的文本进行“透明编码”-（如果您的原始列表包含 0-256 范围内的所有字符代码点，即）：您可以将之前的t 编码为latin1 编码，然后再将其写入文件。但这在语义上是错误的。

您还可以尝试使用 Python 鲜为人知的“bytearray”对象：它提供了处理 8 位数字元素的能力，并且具有可变和可扩展的便利（就像 C“字符串”一样预分配足够的内存空间）

【讨论】：

像您建议的那样替换行修复了问题，现在wc 产生318885 的正确字节数作为Python。简而言之，由于“低效”（现代）编码，使用字符而不是字节几乎浪费了 150000 字节。很高兴知道。（感谢您的全面回答:)）
否 - 使用字符而不是字节记录了错误结果 - 不是你想要的。 utf-8 没有什么不足之处——如果你将自己限制为非重音拉丁字符，文本的大小与 ASCII 相同——但同一个文件可以用大多数已知的人类语言和脚本记录文本，使用超过一个字节每个字符。您正在记录的内容是 0 到 255 之间的数字，并且它们在您的文件中被记录为随机字符。

【解决方案2】：

@jsbueno 是对的。此外，如果你以二进制读取模式打开生成的文件，你会得到很好的结果：

>>> with open('commedia.pfc', 'rb') as f:
>>>    print(len(f.read()))
461491

【讨论】：