在 Python 中解开熊猫数据框？答案

【问题标题】：unmelting a pandas dataframe in Python?在 Python 中解开熊猫数据框？
【发布时间】：2013-02-09 08:01:16
【问题描述】：

我融化了一个 pandas 数据框，用于与 ggplot 一起绘图（这通常需要长格式的数据框），如下所示：

test = pandas.melt(iris, id_vars=["Name"], value_vars=["SepalLength", "SepalWidth"])

这会将 iris 数据集的 Name 字段保留在索引中，但会将列 SepalLength 和 SepalWidth 转换为长格式：

test.ix[0:10]
Out:
           Name     variable  value
0   Iris-setosa  SepalLength    5.1
1   Iris-setosa  SepalLength    4.9
2   Iris-setosa  SepalLength    4.7
3   Iris-setosa  SepalLength    4.6
4   Iris-setosa  SepalLength    5.0
5   Iris-setosa  SepalLength    5.4
6   Iris-setosa  SepalLength    4.6
7   Iris-setosa  SepalLength    5.0
8   Iris-setosa  SepalLength    4.4
9   Iris-setosa  SepalLength    4.9
10  Iris-setosa  SepalLength    5.4

我怎样才能“解开”这个数据框？我希望保留 Name 列，但将 variable 字段的值转换为单独的列。 Name 字段不是唯一的，所以我认为它不能用作索引。我的印象是 pivot 是执行此操作的正确函数，但它不正确：

test.pivot(columns="variable", values="value")
KeyError: u'no item named '

我怎么能这样做？另外，我是否可以解开存在多个长格式列的数据帧，即test 中的多个列类似于上面的variable 列？这意味着columns 将不得不接受列列表，而不是单个值，似乎。谢谢。

【问题讨论】：

标签： python numpy pandas dataform dataframe

【解决方案1】：

我认为这种情况是模棱两可的，因为 test 数据框没有标识每个唯一行的索引。如果melt 只是将行与value_vars SepalLength 和 SepalWidth 堆叠在一起，那么您可以手动创建一个索引以作为枢轴；结果看起来和原来的一样：

In [15]: test['index'] = range(len(test) / 2) * 2
In [16]: test[:10]
Out[16]: 
          Name     variable  value  index
0  Iris-setosa  SepalLength    5.1      0
1  Iris-setosa  SepalLength    4.9      1
2  Iris-setosa  SepalLength    4.7      2
3  Iris-setosa  SepalLength    4.6      3
4  Iris-setosa  SepalLength    5.0      4
5  Iris-setosa  SepalLength    5.4      5
6  Iris-setosa  SepalLength    4.6      6
7  Iris-setosa  SepalLength    5.0      7
8  Iris-setosa  SepalLength    4.4      8
9  Iris-setosa  SepalLength    4.9      9

In [17]: test[-10:]
Out[17]: 
               Name    variable  value  index
290  Iris-virginica  SepalWidth    3.1    140
291  Iris-virginica  SepalWidth    3.1    141
292  Iris-virginica  SepalWidth    2.7    142
293  Iris-virginica  SepalWidth    3.2    143
294  Iris-virginica  SepalWidth    3.3    144
295  Iris-virginica  SepalWidth    3.0    145
296  Iris-virginica  SepalWidth    2.5    146
297  Iris-virginica  SepalWidth    3.0    147
298  Iris-virginica  SepalWidth    3.4    148
299  Iris-virginica  SepalWidth    3.0    149

In [18]: df = test.pivot(index='index', columns='variable', values='value')
In [19]: df['Name'] = test['Name']
In [20]: df[:10]
Out[20]: 
variable  SepalLength  SepalWidth         Name
index                                         
0                 5.1         3.5  Iris-setosa
1                 4.9         3.0  Iris-setosa
2                 4.7         3.2  Iris-setosa
3                 4.6         3.1  Iris-setosa
4                 5.0         3.6  Iris-setosa
5                 5.4         3.9  Iris-setosa
6                 4.6         3.4  Iris-setosa
7                 5.0         3.4  Iris-setosa
8                 4.4         2.9  Iris-setosa
9                 4.9         3.1  Iris-setosa

In [21]: (iris[["SepalLength", "SepalWidth", "Name"]] == df[["SepalLength", "SepalWidth", "Name"]]).all()
Out[21]: 
SepalLength    True
SepalWidth     True
Name           True

【讨论】：

我对您的index 专栏感到困惑。首先，测试是否已经有一个标识每一行的唯一索引，即默认索引？另外，在取范围之前除以 2 然后乘以 2 的目的是什么？你为什么不能这样做：test['index'] = list(test.index) 或类似的东西来为每一行创建一个任意的唯一索引？
len(iris) == 150, len(test) == 300。 test 上的原始索引对于 test 中的每一行都有唯一值，但对于 original iris 数据帧中的每个值都没有。我的代码 range(len(test) / 2) * 2 是两个列表 [0..149] 连接在一起，可以在 test[-10:] 的输出中看到（原始索引和新索引不匹配）。
test（长度300）是原来iris（长度150）的两倍； test 中的前 150 行仅包含 SepalLength 的值，接下来的 150 行仅包含 SepalWidth 的值。所以我做了一个从 [0..149] 开始的索引两次。这有意义吗？