为什么 codecs.iterdecode() 会吃空字符串？答案

【问题标题】：Why codecs.iterdecode() eats empty strings?为什么 codecs.iterdecode() 会吃空字符串？
【发布时间】：2017-10-09 19:24:12
【问题描述】：

为什么下面两种解码方法返回的结果不同？

>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']

这是一个错误还是预期的行为？我的 Python 版本 2.7.13。

【问题讨论】：

它似乎正在检查解码器是否设法返回一个值，但这也会丢弃空字符串：hg.python.org/cpython/file/tip/Lib/codecs.py#l1040

标签： python python-2.7 unicode utf-8 codec

【解决方案1】：

这是正常的。 iterdecode 对编码的块使用迭代器，并在解码的块上返回一个迭代器，但它不保证一一对应。它只保证所有输出块的连接是对所有输入块连接的有效解码。

如果您查看source code，您会发现它明确丢弃了空输出块：

def iterdecode(iterator, encoding, errors='strict', **kwargs):
    """
    Decoding iterator.
    Decodes the input strings from the iterator using an IncrementalDecoder.
    errors and kwargs are passed through to the IncrementalDecoder
    constructor.
    """
    decoder = getincrementaldecoder(encoding)(errors, **kwargs)
    for input in iterator:
        output = decoder.decode(input)
        if output:
            yield output
    output = decoder.decode("", True)
    if output:
        yield output

请注意iterdecode 存在的原因，以及您自己不会在所有块上调用decode 的原因是解码过程是有状态的。一个字符的 UTF-8 编码形式可能会分成多个块。其他编解码器可能具有非常奇怪的状态行为，例如可能会反转所有字符的大小写的字节序列，直到您再次看到该字节序列。

【讨论】：