【问题标题】:Writing pandas DataFrame to JSON in unicode以 unicode 将 pandas DataFrame 写入 JSON
【发布时间】:2017-01-29 10:47:12
【问题描述】:

我正在尝试将包含 unicode 的 pandas DataFrame 写入 json,但内置的 .to_json 函数会转义字符。我该如何解决这个问题?

例子:

import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json')

这给出了:

{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

与预期结果不同:

{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}


我尝试添加 force_ascii=False 参数:
import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json', force_ascii=False)

但这会产生以下错误:

UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 11: character maps to <undefined>


我正在使用 WinPython 3.4.4.2 64bit 和 pandas 0.18.0

【问题讨论】:

    标签: python json pandas unicode


    【解决方案1】:

    还有另一种方法可以做到这一点。因为 JSON 由键(双引号中的字符串)和值(字符串、数字、嵌套的 JSON 或数组)组成,并且因为它与 Python 的字典非常相似,所以您可以使用简单的转换和字符串操作从 Pandas DataFrame 中获取 JSON

    import pandas as pd
    df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
    
    # convert index values to string (when they're something else - JSON requires strings for keys)
    df.index = df.index.map(str)
    # convert column names to string (when they're something else - JSON requires strings for keys)
    df.columns = df.columns.map(str)
    
    # convert DataFrame to dict, dict to string and simply jsonify quotes from single to double quotes  
    js = str(df.to_dict()).replace("'", '"')
    print(js) # print or write to file or return as REST...anything you want
    

    输出:

    {"0": {"0": "τ", "1": "π"}, "1": {"0": "a", "1": "b"}, "2": {"0": 1, "1": 2}}
    

    更新: 根据@Swier 的注释(谢谢),原始数据框中包含双引号的字符串可能存在问题。 df.jsonify() 会转义它们(即'"a"' 会以 JSON 格式生成 "\\"a\\"")。借助字符串方法中的小更新也可以处理此问题。完整示例:

    import pandas as pd
    
    def run_jsonifier(df):
        # convert index values to string (when they're something else)
        df.index = df.index.map(str)
        # convert column names to string (when they're something else)
        df.columns = df.columns.map(str)
    
        # convert DataFrame to dict and dict to string
        js = str(df.to_dict())
        #store indices of double quote marks in string for later update
        idx = [i for i, _ in enumerate(js) if _ == '"']
        # jsonify quotes from single to double quotes  
        js = js.replace("'", '"')
        # add \ to original double quotes to make it json-like escape sequence 
        for add, i in enumerate(idx):
            js = js[:i+add] + '\\' + js[i+add:] 
        return js
    
    # define double-quotes-rich dataframe
    df = pd.DataFrame([['τ', '"a"', 1], ['π', 'this" breaks >>"<""< ', 2]])
    
    # run our function to convert dataframe to json
    print(run_jsonifier(df))
    # run original `to_json()` to see difference
    print(df.to_json())
    

    输出:

    {"0": {"0": "τ", "1": "π"}, "1": {"0": "\"a\"", "1": "this\" breaks >>\"<\"\"< "}, "2": {"0": 1, "1": 2}}
    {"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"\"a\"","1":"this\" breaks >>\"<\"\"< "},"2":{"0":1,"1":2}}
    

    【讨论】:

    • 如果任何文本值中有引号,则将结果转换为字符串并替换引号将产生无效的 json。 pd.DataFrame([['τ', 'a', 1], ['π', 'this breaks &gt;&gt;"&lt;&lt; ', 2]]) 将产生{"0": {"0": "τ", "1": "π"}, "1": {"0": "a", "1": "this breaks &gt;&gt;"&lt;&lt; "}, "2": {"0": 1, "1": 2}}
    • 谢谢@Swier - 我已经更新了解决此类问题的答案
    【解决方案2】:

    打开一个编码设置为 utf-8 的文件,然后将该文件传递给 .to_json 函数可以解决问题:

    with open('df.json', 'w', encoding='utf-8') as file:
        df.to_json(file, force_ascii=False)
    

    给出正确的:

    {"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}
    

    注意:它仍然需要 force_ascii=False 参数。

    【讨论】:

      猜你喜欢
      • 2020-06-12
      • 1970-01-01
      • 2015-05-12
      • 2019-07-11
      • 1970-01-01
      • 2017-12-22
      • 2013-05-31
      • 2019-07-06
      相关资源
      最近更新 更多