【问题标题】:pandas Dataframe: efficiently expanding column containing json into multiple columnspandas Dataframe:有效地将包含json的列扩展为多列
【发布时间】:2020-03-30 13:42:13
【问题描述】:

我有一个数据框,其中列是带有字典的 json 字符串,我需要将 json 扩展为单独的列。示例:

   c1                              c2
0  a1     {'x1': 1, 'x3': 3, 'x2': 2}
1  a2  {'x1': 21, 'x3': 23, 'x2': 22}

应该变成:

   c1    x1    x2    x3
0  a1   1.0   2.0   3.0
1  a2  21.0  22.0  23.0

我的问题与this thread 非常相似,除了我有字符串,而不是字典(尽管字符串评估为字典),并且那里提出的简单优化的解决方案不适用于我的情况。 我有一个可行的解决方案,但它显然是非常低效的。这是我的代码和该线程中提出的解决方案的 sn-p:

import json
import pandas as pd

def expandFeatures(df, columnName):
    """Expands column 'columnName', which contains a dictionary in form of a json string, into N single columns, each containing a single feature"""
    # get names of new columns from the first row
    features = json.loads(df.iloc[0].loc[columnName])
    featureNames = list(features.keys())
    featureNames.sort()
    # add new columns (empty values)
    newCols = list(df.columns) + featureNames
    df = df.reindex(columns=newCols, fill_value=0.0)
    # fill in the values of the new columns
    for index, row in df.iterrows():
        features = json.loads(row[columnName])
        for key,val in features.items():
            df.at[index, key] = val
    # remove column 'columnName'
    return df.drop(columns=[columnName])

def expandFeatures1(df, columnName):
    return df.drop(columnName, axis=1).join(pd.DataFrame(df[columnName].values.tolist()))

df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'], ['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
                    columns=['c1', 'c2'])
df_dict = pd.DataFrame([['a1', {'x1': 1, 'x2': 2, 'x3': 3}], ['a2', {'x1': 21, 'x2': 22, 'x3': 23}]],
                    columns=['c1', 'c2'])

# correct result, but inefficient
print("expandFeatures, df_json")
df = df_json.copy()
print(df)
df = expandFeatures(df, 'c2')
print(df)

# this gives an error because expandFeatures expects a string, not a dictionary 
# print("expandFeatures, df_dict")
# df = df_dict.copy()
# print(df)
# df = expandFeatures(df, 'c2')
# print(df)

# WRONG, doesn't expand anything
print("expandFeatures1, df_json")
df = df_json.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)

# correct and efficient, but not my use case (I have strings not dicts)
print("expandFeatures1, df_dict")
df = df_dict.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)

我确信有一些明显的方法可以提高我的代码效率,使其更类似于其他线程中提出的单行,但我自己看不到它......提前感谢任何帮助。

【问题讨论】:

    标签: json pandas dataframe dictionary


    【解决方案1】:

    如果你的 json 字符串是有效的字典,你可以使用ast.literal_eval 来解析它们:

    import pandas as pd
    from ast import literal_eval
    
    df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'],
                            ['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
                            columns=['c1', 'c2'])
    
    print (pd.concat([df_json,pd.DataFrame(df_json["c2"].apply(literal_eval).to_list())],axis=1).drop("c2",axis=1))
    
    #
       c1  x1  x2  x3
    0  a1   1   2   3
    1  a2  21  22  23
    

    【讨论】:

    • 感谢@henry-yik,这行得通。你使用pd.concat有什么特别的原因吗,它比pd.join更有效吗?
    • 另外,因为我的字符串肯定是json,所以我也可以使用apply(json.loads)。两者之间有什么区别,一个比另一个更有效,还是它们在内部本质上是相同的?
    • concat 基于轴,这符合我们的目的,因为我们不需要进行查找。我不知道astjson 之间的性能差异。
    猜你喜欢
    • 1970-01-01
    • 2012-10-05
    • 1970-01-01
    • 2023-03-13
    • 1970-01-01
    • 2016-11-07
    • 1970-01-01
    • 2017-11-28
    • 1970-01-01
    相关资源
    最近更新 更多