Python将二进制文件转换为字符串，同时忽略非ASCII字符答案

【问题标题】：Python convert binary file into string while ignoring non-ascii charactersPython将二进制文件转换为字符串，同时忽略非ASCII字符
【发布时间】：2015-05-08 13:06:20
【问题描述】：

我有一个二进制文件，我想提取所有 ascii 字符，同时忽略非 ascii 字符。目前我有：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

但是，我在写入文件 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128) 时遇到错误。我如何让 Python 忽略非 ascii？

【问题讨论】：

您确定文件中没有Unicode字符吗？
您的输入文件似乎被编码为 utf-16-le，因此您应该在打开文件时指定该编码。在 Python 2 中你需要使用 codecs.open，但在 Python 3 中你可以使用普通的内置 open

标签： python non-ascii-characters

【解决方案1】：

使用内置的 ASCII 编解码器并告诉它忽略任何错误，例如：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

您可以在 Python 解释器中测试和使用它：

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

只是试图转换为字符串会引发异常。

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...就像尝试将该 unicode 字符串编码为 ASCII 一样：

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...但是告诉编解码器忽略它无法处理的字符可以正常工作：

>>> s.encode('ascii', 'ignore')
'hello  there'

【讨论】：

Python 认为 ascii 的内容是否有预先确定的范围？输出仍在拾取诸如 SOH、ACK 之类的字符（不确定这些是什么，我只是在它们出现在 Sublime Text 中时输入它们）。
@VeraWang SOH 和 ACK 是 ASCII。范围是 0 到 127，分别是 1 和 6。
@VeraWang -- ASCII 字符 0..31 是不可打印的（包括这两个，请参阅此维基百科页面上关于 ASCII 的图表 - en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart）也许更多关于您实际问题的信息'如果这不能给你你需要的东西，重新尝试解决会很有用......

【解决方案2】：

基本上，ASCII 表的取值范围为 [0, 2⁷) 并将它们与（可写或不可写）字符相关联。因此，要忽略非 ASCII 字符，您只需忽略代码不包含在 [0, 2⁷) 中的字符，也就是低于或等于 127。

在python中有一个函数，叫做ord，对应于文档字符串

返回单字符字符串的整数序号。

换句话说，它给你一个字符的代码。现在，您必须忽略所有传递给 ord 并返回大于 128 的字符。这可以通过以下方式完成：

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

现在，如果您只想保留 可打印 个字符，您必须注意所有这些字符（至少在 ASCII 表中）都在 32（空格）和 126（波浪号）之间，所以您必须简单地做：

if 32 <= ord(character) <= 126:

【讨论】：

所以如果我只想要 ASCII printable 字符 [32, 127] 它是一个简单的ord(char) < 128 and ord(char) > 31?
@VeraWang 差不多（127 不可打印），虽然31 < ord(char) < 127 更简单。
@VeraWang 差不多了！你忘记了 127 是 DELETE 字符，不可打印，所以区间现在是闭合的 [32, 126]：ord(character) <= 126 and ord(character) >= 32
或者改成32 <= ord(character) <= 126，因为这显然是她想要的。那应该是足够的改变了。
你一直以if ord(character) >= 32 and ord(character) <= 126...为什么？