如何从标准输入读取输入并执行编码？答案

【问题标题】：How to read inputs from stdin and enforce an encoding?如何从标准输入读取输入并执行编码？
【发布时间】：2018-05-05 15:50:52
【问题描述】：

目标是在 Python2 和 Python3 中持续读取 stdin 并强制执行 utf8。

我尝试过以下解决方案：

我试过了：

#!/usr/bin/env python

from __future__ import print_function, unicode_literals
import io
import sys

# Supports Python2 read from stdin and Python3 read from stdin.buffer
# https://stackoverflow.com/a/23932488/610569
user_input = getattr(sys.stdin, 'buffer', sys.stdin)


# Enforcing utf-8 in Python3
# https://stackoverflow.com/a/16549381/610569
with io.TextIOWrapper(user_input, encoding='utf-8') as fin:
    for line in fin:
        # Reads the input line by line
        # and do something, for e.g. just print line.
        print(line)

代码在 Python3 中有效，但在 Python2 中，TextIOWrapper 没有读取函数，它会抛出：

Traceback (most recent call last):
  File "testfin.py", line 12, in <module>
    with io.TextIOWrapper(user_input, encoding='utf-8') as fin:
AttributeError: 'file' object has no attribute 'readable'

这是因为在 Python 中 user_input ，即 sys.stdin.buffer 是 _io.BufferedReader 对象及其属性有readable：

<class '_io.BufferedReader'>

['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_dealloc_warn', '_finalizing', 'close', 'closed', 'detach', 'fileno', 'flush', 'isatty', 'mode', 'name', 'peek', 'raw', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']

虽然在 Python2 中 user_input 是一个文件对象，但它的属性没有 readable：

<type 'file'>

['__class__', '__delattr__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'closed', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']

【问题讨论】：

标签： python file utf-8 io stdin

【解决方案1】：

如果您不需要完整的io.TextIOWrapper，而只需要用于读取的解码流，则可以使用codecs.getreader() 创建解码包装器：

reader = codecs.getreader('utf8')(user_input)
for line in reader:
    # do whatever you need...
    print(line)

codecs.getreader('utf8') 为codecs.StreamReader 创建一个工厂，然后使用原始流对其进行实例化。我不确定StreamReader 是否支持with 上下文，但这可能不是绝对必要的（阅读后不需要关闭STDIN，我猜......）。

我已经成功地在底层流只提供非常有限的接口的情况下使用了这个解决方案。

更新（第二版）

从 cmets 可以清楚地看出，您实际上需要一个 io.TextIOWrapper 才能在交互模式下进行适当的行缓冲等； codecs.StreamReader 仅适用于管道输入等。

使用this answer，我能够让交互式输入正常工作：

#!/usr/bin/env python
# coding: utf8

from __future__ import print_function, unicode_literals
import io
import sys

user_input = getattr(sys.stdin, 'buffer', sys.stdin)

with io.open(user_input.fileno(), encoding='utf8') as f:
    for line in f:
        # do whatever you need...
        print(line)

这将创建一个io.TextIOWrapper，并从二进制 STDIN 缓冲区强制编码。

【讨论】：

如果需要缓冲区进行流式传输，则不太正确。如果您使用 Python3 在 OP 中尝试代码 sn-p，您会看到不同的行为。 sys.stdin 的行为不同于普通的 input() 或 raw_input()。在我的场景中，标准输入对于保持流是必要的，例如如果有一个套接字并且不应该关闭流。
对于上下文，此代码将在github.com/marian-nmt/marian-dev/blob/master/scripts/server/… 中使用，其中套接字为来自标准输入的用户输入打开。虽然可以编写一个while 循环来使用input()，但当stdin 本身就这样做时，这样做有点奇怪。问题是当传递 utf8 字符串时，需要处理它，因此 io.TextIOWrapper =)
我不确定我是否了解您的 cmets。我没有考虑内置的[raw_]input()，我只是重用了您的user_input 变量，它在OP 中定义为getattr(sys.stdin, 'buffer', sys.stdin)。除非有错误，否则建议的解决方案应该与流一起使用（它不会关闭 STDIN 或其他东西）。
@alvas 我现在看到了。是的，使用codecs.StreamReader，您需要重复ctrl+D 信号来触发刷新。我花了三个人来结束剧本……
我从中学到的教训是，在大多数情况下，您不应尝试直接实例化 io 类，而应使用 io.open()。

【解决方案2】：

您是否尝试过在 python 中强制 utf-8 编码如下：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

【讨论】：

重点是避免设置语言环境。因此该脚本支持Python2和Python3。而且，不鼓励重新加载默认编码 =(
这会影响 Unicode 和 ASCII 之间的隐式转换，全局。这是一个糟糕的想法，因为库的构建依赖于抛出异常的非 ASCII 数据。这种变化打破了这种期望。