【问题标题】:parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key将字典数据表单中的熊猫数据框列解析为每个字典键的新列
【发布时间】:2020-07-09 17:31:03
【问题描述】:

在 python 3 中,熊猫。想象有一个数据框 df 有一列 x

df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)

x 列的数据看起来像字典。想知道如何将它们解析成一个新的数据框,所以这里的每个键都变成一个新列?

所需的输出数据框是这样的

x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4

这篇文章中的解决方案似乎都不适用于这种情况

parsing a dictionary in a pandas dataframe cell into new row cells (new columns)

df1=pd.DataFrame(df.loc[:,'x'].values.tolist())

print(df1)

产生相同的数据帧。没有将列分成每列的每个键

有 2 美分吗? 谢谢!

【问题讨论】:

    标签: python-3.x pandas parsing


    【解决方案1】:

    您还可以映射json.loads 并转换为类似的数据框;

    import json
    df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
    print(df1)
    

       a  b  c
    0  1  2  3
    1  2  3  4
    

    这个测试比通过 ast 评估要快,下面是 40K 行的基准:

    m = pd.concat([df]*20000,ignore_index=True)
    
    %%timeit
    import json
    df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
    #256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %%timeit
    import ast
    df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
    #1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    import ast
    df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
    #1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    【讨论】:

    • 你能测试一下pd.DataFrame([json.loads(x) for x in df['x']], index=m.index) 吗?
    • @jezrael map is a tad bit faster , : 312 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    • 嗯,我收到了%timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index) 206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index) 210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    • 快一点list comprehension
    • 感谢您的详细解答。真的很感激。学到了很多东西! @anky_91
    【解决方案2】:

    因为需要字典的字符串 repr 将值转换为 dictionaries:

    import ast, json
    
    #performance for repeated sample data, in real data should be different
    m = pd.concat([df]*20000,ignore_index=True)
    
    In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
    206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    #anky_91 solution    
    In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
    210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
    903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
    893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    print(df1)
       a  b  c
    0  1  2  3
    1  2  3  4
    

    最后追加到原始:

    df = df.join(df1)
    print(df)
                               x  a  b  c
    0  {"a":"1","b":"2","c":"3"}  1  2  3
    1  {"a":"2","b":"3","c":"4"}  2  3  4
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-01-31
      • 2019-08-24
      • 2015-06-02
      • 2020-07-22
      • 2017-12-12
      相关资源
      最近更新 更多