UnicodeEncodeError AND TypeError：只能将 str（不是“字节”）连接到 str答案

【问题标题】：UnicodeEncodeError AND TypeError: can only concatenate str (not “bytes”) to strUnicodeEncodeError AND TypeError：只能将 str（不是“字节”）连接到 str
【发布时间】：2019-04-07 10:37:36
【问题描述】：

我有一个问题，我尝试使用 Google Custom search api for python 在结果中进行搜索，但是当我搜索存储在变量中的内容而不是手动编写它们时，它会显示 UnicodeEncodeError: 'ascii' codec can '不编码位置 104 中的字符'\xa2'：序数不在范围内（128）。当我用

解决它时

    .encode('ascii', 'ignore').decode('ascii')

它显示另一个错误，例如 google 自定义搜索

    TypeError: can only concatenate str (not "bytes") to str.

PD：我也尝试过诸如 str() 或 .decode 之类的东西。

编辑：当然，存储在变量中的输入来自读取图像文本的 Pytesseract。因此，我将此信息存储在一个变量中，然后尝试在谷歌自定义搜索 API 中搜索此信息。当它显示 Unicode 错误时，我查看了 stackoverflow 的解决方案，发现我可以尝试对变量进行 .decode 以不再出现此问题。事实上这个问题已经解决了，但现在又出现了一个问题，那就是 TypeError: can only concatenate str (not "bytes") to str。所以，我不能使用 .decode 函数，因为它会显示另一个错误。我能做什么？

编辑 2.0

text_photo = pytesseract.image_to_string(img2) #this will read the text and put it in a variable
text_photo = text_photo.replace('\r', '').replace('\n', '') #this will elimininate de /n


rawData = urllib.request.urlopen(url_google_1 + text_photo1 + '+' + text_photo2 + url_google_2).read()

url_google 1 包含用于谷歌搜索的链接的第一部分（api 密钥...），第二部分包含我想从谷歌获得的内容。在中间我添加了变量，因为它是我想要搜索的。如果我写你好，问题是 tesseract 写入的格式不兼容我尝试使用 str(text_photo) 和 .decode 但不起作用 json_data = json.loads(rawData)

【问题讨论】：

标签： python unicode python-unicode google-custom-search

【解决方案1】：

我无法理解您的具体问题的所有细节，但我很确定根本原因如下：

Python 3 区分了两种字符串类型，str 和 bytes，它们相似但不兼容。

一旦您了解了这意味着什么，他们每个人可以/不能做什么，以及如何从一个到另一个，我相信您可以弄清楚如何正确构建 API 调用的 URL。

不同类型，不兼容：

>>> type('abc'), type(b'abc')
(<class 'str'>, <class 'bytes'>)

>>> 'abc' + b'abc'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be str, not bytes

>>> b'abc' + 'abc'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes

如果你想组合它们，你需要将所有东西都转换为相同的类型。对于转换，将str编码为bytes，将bytes解码为str：

>>> 'abc'.encode()
b'abc'
>>> b'abc'.decode()
'abc'

str.encode 和 bytes.decode 方法采用可选的 encoding= 参数，默认为 UTF-8。此参数定义str 中的字符与bytes 对象中的八位字节之间的映射。如果使用给定编码将字符映射到字节时出现问题，您将遇到UnicodeEncodeError。如果您使用给定映射中未定义的字符，则会发生这种情况：

>>> '5 £'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 2: ordinal not in range(128)

同样，如果某些文本已使用编码 X 进行编码，但您尝试使用编码 Y 对其进行解码，您可能会看到 UnicodeDecodeError：

>>> b = '5 £'.encode('utf8')
>>> b
b'5 \xc2\xa3'
>>> b.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

您可以使用errors="ignore" 策略避免异常，但这样会丢失信息：

>>> '5 £'.encode('ascii', errors='ignore')
b'5 '

通常，如果您使用文本，则在任何地方都使用str。您也不应该经常需要直接使用.encode/.decode；通常文件处理程序等接受str，并在后台将它们转换为bytes。

在您的情况下，您需要找出str 和bytes 的混合位置和原因，然后在连接之前确保所有内容都具有相同的类型。

【讨论】：