在一个单元格中转换具有多个值的数据框答案

【问题标题】：Converting Dataframe with multplie values in one cell在一个单元格中转换具有多个值的数据框
【发布时间】：2021-04-23 06:53:06
【问题描述】：

我有一个如下所示的数据框

id                          value       index
5eb3cbcc434474213e58b49a    [1,2,3,4,6] [0,1,2,3,4]
5eb3f335434474213e58b49d    [1,2,3,4]   [0,2,3,4]
5eb3f853434474213e58b49f    [1,2,3,4]   [0,2,3,4]
5eb40395434474213e58b4a2    [1,2,3,4]   [0,1,2,3]
5eb40425434474213e58b4a5    [1,2]       [0,2]

我尝试在以下内容中转换此数据框，因为索引旨在作为每个单独值的标题，看起来像这样：

id                          0   1   2   3   4
5eb3cbcc434474213e58b49a    1   2   3   4   6
5eb3f335434474213e58b49d    1   Nan 2   3   4
5eb3f853434474213e58b49f    1   Nan 2   3   4
5eb40395434474213e58b4a2    1   2   3   4   Nan
5eb40425434474213e58b4a5    1   Nan 2   Nan Nan

我尝试首先拆分列表列表：

new_df = pd.DataFrame(df.Value.str.split(',').tolist(), index=df.Index).stack()
new_df = new_df.reset_index([0, 'Index'])
new_df.columns = ['Value', 'Index']

但是我收到了错误

TypeError: unhashable type: 'list'

是什么导致了这个错误？

【问题讨论】：

以下答案是否符合您的要求？如果是，请查看答案并选择一个给accept the answer。这是 StackOverflow 帮助其他有类似问题的用户获得解决方案的方式。谢谢！ :-)

标签： python pandas dataframe

【解决方案1】：

.apply()可以和pd.Series()一起使用，如下：

df = df.set_index('id').apply(lambda x: pd.Series(x['value'], index=x['index']), axis=1).reset_index()


print(df)

                         id    0    1    2    3    4
0  5eb3cbcc434474213e58b49a  1.0  2.0  3.0  4.0  6.0
1  5eb3f335434474213e58b49d  1.0  NaN  2.0  3.0  4.0
2  5eb3f853434474213e58b49f  1.0  NaN  2.0  3.0  4.0
3  5eb40395434474213e58b4a2  1.0  2.0  3.0  4.0  NaN
4  5eb40425434474213e58b4a5  1.0  NaN  2.0  NaN  NaN

这利用了.apply() 函数特性：

默认行为（无）取决于应用函数：类似列表的结果将作为一系列返回那些。然而如果应用函数返回一个系列，这些是扩展到列。

此功能非常方便，可帮助我们为需要将数据扩展至列的问题提供简单的解决方案，同时通过保留现有行索引并将其代代到这些新列，将新列合并到现有数据中。我用它为一个经典问题提供了simple answer：How to merge a Series and DataFrame。

【讨论】：

hmmm，考虑性能，我觉得很糟糕。
@jezrael 你是对的，这对小数据集有好处。
顺便说一句，我真的很惊讶这样的差异，有趣

【解决方案2】：

在列表理解中创建字典列表并传递给DataFrame构造函数，如果不需要慢速解决方案，最后附加到原始：

L = [dict(zip(x, y)) for x, y in zip(df.pop('index'), df.pop('value'))]

df = df.join(pd.DataFrame(L, index=df.index))
print (df)
                         id  0    1  2    3    4
0  5eb3cbcc434474213e58b49a  1  2.0  3  4.0  6.0
1  5eb3f335434474213e58b49d  1  NaN  2  3.0  4.0
2  5eb3f853434474213e58b49f  1  NaN  2  3.0  4.0
3  5eb40395434474213e58b4a2  1  2.0  3  4.0  NaN
4  5eb40425434474213e58b4a5  1  NaN  2  NaN  NaN

性能：

#5k rows
df = pd.concat([df] * 1000, ignore_index=True)


In [123]: %timeit df.set_index('id').apply(lambda x: pd.Series(x['value'], index=x['index']), axis=1).reset_index()
2.14 s ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#similar code, because pop failed for test performance
In [124]: %timeit df.drop(['index','value'], axis=1).join(pd.DataFrame([dict(zip(x, y)) for x, y in zip(df['index'], df['value'])], index=df.index))
15.2 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#50k rows
df = pd.concat([df] * 10000, ignore_index=True)

In [126]: %timeit df.set_index('id').apply(lambda x: pd.Series(x['value'], index=x['index']), axis=1).reset_index()
24.2 s ± 1.14 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [127]: %timeit df.drop(['index','value'], axis=1).join(pd.DataFrame([dict(zip(x, y)) for x, y in zip(df['index'], df['value'])], index=df.index))
128 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：