【问题标题】:Creating new pandas dataframe based on another dataframe基于另一个数据框创建新的熊猫数据框
【发布时间】:2020-09-01 15:17:14
【问题描述】:

编辑:

pandas 1.0.5有一个bug,升级到1.1.1后就没有错误了。

我有一个看起来像这样的 pandas 数据框:

   Name      Date      Price      Label   Y      Z
   foo1     1/1/20      100       1       _      _
   foo1     1/1/20      200       2       _      _
    .       .           .         .       .      .
    .       .           .         .       .      .
   foo1     1/8/20      240       1       _      _
   foo2     1/2/20      500       1       _      _
    .       .           .         .       .      .
    .       .           .         .       .      .
   foo2     1/7/20      423       4       _      _
    .       .           .         .       .      .
    .       .           .         .       .      .
  • Name 列有 80 个唯一值,即 foo1 - foo80
  • 有 20 个唯一的 Date
  • 有 4 个唯一的 Label
  • Y 和 Z 列与新数据框无关

我想创建一个表 s.t,它将有 80 行(每个名称为每个名称)和 20*4 + 1 列(每个日期标签组合为 20x4,名称为 1)。

最终的数据框应如下所示:

**Name 1/1/20(Label1)  1/1/20(Label2)  1/1/20(Label3)  1/1/20(Label4)  1/2/20(Label1)    ...    4/7/20(Label4)**
 foo1    100             200              300             -1              -1                        -1
 foo2    -1               -1               -1             -1              500                       -1
...............
...............

-1表示原始条目中没有特定Name-Date-Label组合的条目。

我基本上是 pandas 的新手,我当然可以手动迭代构建数据框(if..else 解决方案),但我相信有一个更快、可读、更简单的解决方案。

> df.columns
Index(['A', 'B', 'Date', 'C',
   'D', 'Price', 'Label', 'E',
   'Name', 'F', 'G', 'H', 'I',
   'J'],
  dtype='object')

> df.head(10).to_dict('list')

{'A': [160, 457, 457, 482, 482, 482, 482, 423, 223, 506],
'B': ['8/27/2015 0:00',
'10/15/2015 0:00',
'10/15/2015 0:00',
'10/28/2015 0:00',
'10/28/2015 0:00',
'10/28/2015 0:00',
'10/28/2015 0:00',
'9/29/2015 0:00',
'9/9/2015 0:00',
'11/9/2015 0:00'],
'Date': ['8/28/2015 0:00',
'10/16/2015 0:00',
'10/16/2015 0:00',
'10/29/2015 0:00',
'10/29/2015 0:00',
'10/29/2015 0:00',
'10/29/2015 0:00',
'9/30/2015 0:00',
'9/10/2015 0:00',
'11/10/2015 0:00'],
'C': [5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
'D': [1271, 1825, 1825, 1455, 1455, 1455, 1455, 2522, 1385, 1765],
'Price': [1058, 1685, 1615, 1195, 1255, 1279, 1295, 2285, 1285, 1665],
'Label': [3, 3, 2, 1, 3, 4, 2, 2, 1, 4],
'E': [13, 127, 127, -1, -1, -1, -1, -1, -1, -1],
'Name': ['foo1',
'foo2',
'foo2',
'foo3',
'foo3',
'foo3',
'foo3',
'foo4',
'foo4',
'foo3'],
'F': [4, 4, 4, 3, 3, 3, 3, 3, 3, 3],
'G': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'H': ['Friday',
'Friday',
'Friday',
'Thursday',
'Thursday',
'Thursday',
'Thursday',
'Wednesday',
'Thursday',
'Tuesday'],
'I': [213, 140, 210, 260, 200, 176, 160, 237, 100, 100],
'J': [16.758457907159716,
7.671232876712329,
11.506849315068493,
17.869415807560138,
13.745704467353955,
12.096219931271474,
10.996563573883162,
9.397303727200637,
7.220216606498194,
5.6657223796034]}

使用

import pandas_datareader.data as web
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({
    # 'A': [160, 457, 457, 482, 482, 482, 482, 423, 223, 506],
    # 'B': ['8/27/2015 0:00','10/15/2015 0:00','10/15/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','9/29/2015 0:00','9/9/2015 0:00','11/9/2015 0:00'],
    'Date': ['8/28/2015 0:00','10/16/2015 0:00','10/16/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','9/30/2015 0:00','9/10/2015 0:00','11/10/2015 0:00'],
    # 'C': [5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
    # 'D': [1271, 1825, 1825, 1455, 1455, 1455, 1455, 2522, 1385, 1765],
    'Price': [1058, 1685, 1615, 1195, 1255, 1279, 1295, 2285, 1285, 1665],
    'Label': [3, 3, 2, 1, 3, 4, 2, 2, 1, 4],
    # 'E': [13, 127, 127, -1, -1, -1, -1, -1, -1, -1],
    'Name': ['foo1','foo2','foo2','foo3','foo3','foo3','foo3','foo4','foo4','foo3'],
    # 'F': [4, 4, 4, 3, 3, 3, 3, 3, 3, 3],
    # 'G': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    # 'H': ['Friday','Friday','Friday','Thursday','Thursday','Thursday','Thursday','Wednesday','Thursday','Tuesday'],
    # 'I': [213, 140, 210, 260, 200, 176, 160, 237, 100, 100],
    # 'J': [16.758457907159716,7.671232876712329,11.506849315068493,17.869415807560138,13.745704467353955,12.096219931271474,10.996563573883162,9.397303727200637,7.220216606498194,5.6657223796034]
})
df.pivot(index='Name', columns=['Date', 'Label'], values='Price')`

我明白了

ValueError                                Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_level_number(self, level)
   1294         try:
-> 1295             level = self.names.index(level)
   1296         except ValueError:

ValueError: 'Date' is not in list

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-17-542e5c02777d> in <module>
----> 1 df.pivot(index='Name', columns=['Date', 'Label'], values='Price')

~\anaconda3\lib\site-packages\pandas\core\frame.py in pivot(self, index, columns, values)
   5921         from pandas.core.reshape.pivot import pivot
   5922 
-> 5923         return pivot(self, index=index, columns=columns, values=values)
   5924 
   5925     _shared_docs[

~\anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot(data, index, columns, values)
    448         else:
    449             indexed = data._constructor_sliced(data[values].values, index=index)
--> 450     return indexed.unstack(columns)
    451 
    452 

~\anaconda3\lib\site-packages\pandas\core\series.py in unstack(self, level, fill_value)
   3548         from pandas.core.reshape.reshape import unstack
   3549 
-> 3550         return unstack(self, level, fill_value)
   3551 
   3552     # ----------------------------------------------------------------------

~\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in unstack(obj, level, fill_value)
    396             # _unstack_multiple only handles MultiIndexes,
    397             # and isn't needed for a single level
--> 398             return _unstack_multiple(obj, level, fill_value=fill_value)
    399         else:
    400             level = level[0]

~\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in _unstack_multiple(data, clocs, fill_value)
    318     index = data.index
    319 
--> 320     clocs = [index._get_level_number(i) for i in clocs]
    321 
    322     rlocs = [i for i in range(index.nlevels) if i not in clocs]

~\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in <listcomp>(.0)
    318     index = data.index
    319 
--> 320     clocs = [index._get_level_number(i) for i in clocs]
    321 
    322     rlocs = [i for i in range(index.nlevels) if i not in clocs]

~\anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_level_number(self, level)
   1296         except ValueError:
   1297             if not is_integer(level):
-> 1298                 raise KeyError(f"Level {level} not found")
   1299             elif level < 0:
   1300                 level += self.nlevels

KeyError: 'Level Date not found'

谢谢!

【问题讨论】:

  • 这能回答你的问题吗? How to pivot a dataframe
  • 你在用什么pd.__version__?这看起来像一个错误
  • 使用版本 1.0.5,我应该'conda update pandas'吗?
  • 升级到1.1.1,现在可以使用了!谢谢!

标签: python pandas dataframe


【解决方案1】:

您正在寻找df.pivot

df = df.pivot(index='Name', columns=['Date', 'Label'], values='Price')

警告:如果任何名称-日期-标签组合重复(即出现在多行中),则会引发错误。使用pivot_table 或更好的groupby + unstack

如果NameDateLabel 在索引中,则使用unstack 而不是pivot


使用示例数据更新

df = pd.DataFrame({
    # 'A': [160, 457, 457, 482, 482, 482, 482, 423, 223, 506],
    # 'B': ['8/27/2015 0:00','10/15/2015 0:00','10/15/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','10/28/2015 0:00','9/29/2015 0:00','9/9/2015 0:00','11/9/2015 0:00'],
    'Date': ['8/28/2015 0:00','10/16/2015 0:00','10/16/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','10/29/2015 0:00','9/30/2015 0:00','9/10/2015 0:00','11/10/2015 0:00'],
    # 'C': [5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
    # 'D': [1271, 1825, 1825, 1455, 1455, 1455, 1455, 2522, 1385, 1765],
    'Price': [1058, 1685, 1615, 1195, 1255, 1279, 1295, 2285, 1285, 1665],
    'Label': [3, 3, 2, 1, 3, 4, 2, 2, 1, 4],
    # 'E': [13, 127, 127, -1, -1, -1, -1, -1, -1, -1],
    'Name': ['foo1','foo2','foo2','foo3','foo3','foo3','foo3','foo4','foo4','foo3'],
    # 'F': [4, 4, 4, 3, 3, 3, 3, 3, 3, 3],
    # 'G': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    # 'H': ['Friday','Friday','Friday','Thursday','Thursday','Thursday','Thursday','Wednesday','Thursday','Tuesday'],
    # 'I': [213, 140, 210, 260, 200, 176, 160, 237, 100, 100],
    # 'J': [16.758457907159716,7.671232876712329,11.506849315068493,17.869415807560138,13.745704467353955,12.096219931271474,10.996563573883162,9.397303727200637,7.220216606498194,5.6657223796034]
})
df.Date = pd.to_datetime(df.Date)
df = df.pivot(index='Name', columns=['Date', 'Label'], values='Price')
df = df.fillna(-1)
print(df)

输出

Date  2015-08-28 2015-10-16         2015-10-29  ...         2015-09-30 2015-09-10 2015-11-10
Label          3          3       2          1  ...       2          2          1          4
Name                                            ...
foo1      1058.0        NaN     NaN        NaN  ...     NaN        NaN        NaN        NaN
foo2         NaN     1685.0  1615.0        NaN  ...     NaN        NaN        NaN        NaN
foo3         NaN        NaN     NaN     1195.0  ...  1295.0        NaN        NaN     1665.0
foo4         NaN        NaN     NaN        NaN  ...     NaN     2285.0     1285.0        NaN

[4 rows x 10 columns]

【讨论】:

  • 谢谢,我会更深入地了解一下旋转。关于您的回答,我得到“ValueError:'Date'不在列表中”。这两列都不在索引中。
  • 完成了,猜猜它们在索引中?对不起,我错过了解释索引的含义。告诉你,我是熊猫新手:D
  • pandas 数据帧在行和列中都有索引,我们通常所说的“索引”实际上是“行索引”
  • @PhilipL 嗯.. 这行代码是否工作df.pivot(index='Name', columns=['Date', 'Label'], values='Price')
  • @PhilipL 请编辑您的问题并显示完整的回溯
猜你喜欢
  • 2019-02-02
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-27
  • 2018-05-20
  • 2017-01-01
  • 1970-01-01
相关资源
最近更新 更多