Python 中的 Unicode - 解析 JSON答案

【问题标题】：Unicode in Python - parsing JSONPython 中的 Unicode - 解析 JSON
【发布时间】：2015-01-25 05:38:03
【问题描述】：

我编写了这段小代码来获取 JSON 文件并将其内容导入到 consul 键值存储中 - 我很高兴递归完全按照我的预期工作，但是当源 .json 文件包含非-ASCII：

#!/usr/bin/python

import sys
import json

filename = str(sys.argv[1])
fh = open(filename)

def printDict (d, path):
  for key in d:
    if isinstance(d[key], dict):
      printDict(d[key], path + str(key) + "/")
    else:
      print 'curl -X PUT http://localhost:8500/v1/kv/' + filename + path + key + ' -d "' + str(d[key]) + '"'
  return

j = json.load(fh)
printDict(j, "/")

磁盘上的失败 JSON 文件示例：

{
    "FacetConfig" : {
        "facet:price-lf-p" : {
             "prefixParts" : "£"
        }
    }
}

当我按原样运行代码时，我遇到了一个令人讨厌的异常，因为那个漂亮的简单 str() 无法将英国货币英镑符号转换为 7 位 ASCII：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

我怎样才能解决这个问题，而不会过多地浪费一开始是小而优雅的代码？ :)

【问题讨论】：

标签： python json unicode

【解决方案1】：

而不是使用str()，encode 明确的 unicode 值。由于您将值用作 URL 元素，因此您必须将密钥编码为 UTF-8，然后 URL 引用它；该值只需要编码为 UTF-8。

import urllib

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + path +
       urllib.quote(key.encode('utf8')) + ' -d "' + 
       unicode(d[key]).encode('utf8') + '"')

您可以在此处使用字符串格式以使其更具可读性：

print 'curl -X PUT http://localhost:8500/v1/kv/{}{}{} -d "{}"'.format(
    filename, path, urllib.quote(key.encode('utf8')), 
    unicode(d[key]).encode('utf8'))

如果d[key] 始终是字符串值，则unicode() 调用是多余的，但如果您还有数字、布尔值或None 值，这将确保代码继续工作。

服务器可能需要一个 Content-Type 标头；如果您确实发送了一个，也许可以考虑在标题中添加一个charset=utf8 参数。然而，看起来 Consul 将数据视为不透明的。

【讨论】：

嗯，是 d[key] 持有神奇的 £ 符号，但是我得到一个更严重的错误：AttributeError: 'module' object has no attribute 'urlquote'
@gdhgdh：对，是我的错字。您仍然希望正确编码您的 URL。
我花了一点时间来实际阅读错误并注意到它是代码的另一部分，因此我删除了后续问题。谢谢你的回复都一样！ :)

【解决方案2】：

只需从str(d[key]) 中删除str。也就是说，

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + 
       path + key + ' -d "' + str(d[key]) + '"')

变成：

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + 
       path + key + ' -d "' + d[key] + '"')

这里的问题是，Python 2 中的str 类型基本上仅限于ASCII 字符。 type(d[key]) 是unicode，所以你不能把它转换成str...不过没关系，我们无论如何都可以打印出来。

【讨论】：

很公平，我提供了一个不完整的来源；真正的文件有数百行，包含布尔值和整数，str() 也提供了很好的表示。因此，仅删除 str() 并没有帮助（我已经尝试过；）
@gdhgdh: 然后使用unicode() 而不是str() 并进行编码。
简单地使用 unicode() 解决了它 - 甚至没有意识到这是一件事。太好了，谢谢。现在我如何支持实际上不是答案的答案？ :)

【解决方案3】：

我怎样才能解决这个问题，而又不浪费太多的代码？开始时小巧而优雅？

很遗憾，为了防止解码/编码错误，还需要几个额外的步骤。 python 2.x 有很多地方可以进行 implicit 编码/解码，即在你背后未经你的许可。当 python 进行隐式编码/解码时，它使用 ascii 编解码器，如果存在 utf-8（或任何其他非 ascii）字符，这将导致编码/解码错误。因此，您必须找到所有 python 进行隐式编码/解码的地方，并将它们替换为显式编码/解码——如果您希望您的程序在这些地方处理非 ascii 字符。

至少，任何来自外部源的输入都应该在继续之前解码为 unicode 字符串，这意味着您必须知道输入的编码。但是如果将 unicode 字符串与常规字符串结合使用，则会出现编码/解码错误，例如：

#-*- coding: utf-8 -*-   #Allows utf-8 characters in your source code
unicode_str = '€'.decode('utf-8')
my_str = '{0}{1}'.format('This is the Euro sign: ', unicode_str) 

--output:--
Traceback (most recent call last):
  File "1.py", line 3, in <module>
    my_str = '{0}{1}'.format('hello', unicode_str) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

因此，您的所有字符串可能都应该解码为 unicode 字符串。那么当你想输出字符串时，你需要对unicode字符串进行编码。

import sys
import json
import codecs
import urllib

def printDict(d, path, filename):
    for key, val in d.items():  #key is a unicode string, val is a unicode string or dict
        if isinstance(val, dict): 
            printDict(
                val,
                u'{0}{1}/'.format(path, key),  #format() specifiers require 0,1 for python 2.6
                filename
            )
        else:
            key_str = key.encode('utf-8')
            val_str = val.encode('utf-8')

            url = '{0}{1}{2} -d "{3}"'.format(
                filename, 
                path, 
                key_str, 
                val_str
            )
            print url
            url_escaped = urllib.quote(url)
            print url_escaped

            curl_cmd = 'curl -X PUT'            
            base_url = 'http://localhost:8500/v1/kv/'
            print "{0} {1}{2}".format(curl_cmd, base_url, url_escaped)


filename = sys.argv[1].decode('utf-8')
file_encoding = 'utf-8'
fh = codecs.open(filename, encoding=file_encoding)
my_json = json.load(fh)
fh.close()

print my_json

path = "/"
printDict(my_json, path.decode('utf-8'), filename)  #Can the path have  non-ascii characters in it?

--output:--
{u'FacetConfig': {u'facet:price-lf-p': {u'prefixParts': u'\xa3'}}}
data.txt/FacetConfig/facet:price-lf-p/prefixParts -d "£"
data.txt/FacetConfig/facet%3Aprice-lf-p/prefixParts%20-d%20%22%C2%A3%22
curl -X PUT http://localhost:8500/v1/kv/data.txt/FacetConfig/facet%3Aprice-lf-p/prefixParts%20-d%20%22%C2%A3%22

【讨论】：

不要使用codecs.open()； json.load() 期望在 Python 2 中加载字节。
不允许使用不基于 ASCII 的编码（如 UCS-2），应使用 codecs.getreader(encoding)(fp) 包装，或直接解码为 unicode 对象并传递给loads()。 docs.python.org/2/library/json.html#module-json
对不起，误读了。就个人而言，我会避免使用codecs.getreader()，因为codecs I/O 实现存在很多问题。请改用io.open()。但显然，OP 的问题不在于输入文件。
显然，输入和输出是相关的，它们必须相互同步操作。因此，您应该通过将输入解码为 unicode 字符串来建立基线，例如通过在打开文件时指定文件的编码，然后为输出您 encode() unicode 字符串。有关如何执行此操作的示例，请参阅我的答案。