【发布时间】:2017-12-06 20:37:37
【问题描述】:
这个问题是关于一个函数在应用于两个不同的数据帧时以意想不到的方式表现 - 更准确地说,是 groupby 对象。要么是我遗漏了一些明显错误的东西,要么是 pandas 中存在错误。
我编写了下面的函数来为 groupby 对象中的每个组附加一行。This question 是与该函数相关的另一个问题。
def myfunction(g, now):
'''This function appends a row to each group and populates the DTM column value of that row with the current timestamp. Other columns of the new row will have NaN s.
g: a groupby object
now: current timestamp
returns a dataframe that has the current timestamp appended in the DTM column for each group
'''
g.loc[g.shape[0], 'DTM'] = now # Appending the current timestamp to a DTM column in each group
return g
我们将运行两个测试来测试该功能。
测试 1
它在链接问题中的数据框a 上按预期工作(在上述问题中演示)。为了更清楚起见,这里稍微增加了重新运行(主要是从链接的问题中复制粘贴)。
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a
Out[50]:
0
first second
bar one 0.134379
one 0.967928
two 0.067502
baz one 0.182887
two 0.926932
foo one 0.806225
two 0.718322
two 0.932114
qux one 0.772494
two 0.141510
应用函数,
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
它为每个组添加了一个新行。添加了一个新的DTM 列,因为它不在原来的a 中。组是 first - second 对。
a
Out[52]:
first second 0 DTM
first second
bar one 0 bar one 0.134379 NaT
1 bar one 0.967928 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
two 2 bar two 0.067502 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
baz one 3 baz one 0.182887 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 4 baz two 0.926932 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
foo one 5 foo one 0.806225 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 6 foo two 0.718322 NaT
7 foo two 0.932114 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
qux one 8 qux one 0.772494 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 9 qux two 0.141510 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
一些改进,
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
这给出了最终的a,
a
Out[62]:
0 DTM
first second
bar one 0.371683 NaT
one 0.327870 NaT
one NaN 2017-07-03 18:56:33.183
two 0.048794 NaT
two NaN 2017-07-03 18:56:33.183
baz one 0.462747 NaT
one NaN 2017-07-03 18:56:33.183
two 0.758674 NaT
two NaN 2017-07-03 18:56:33.183
foo one 0.238607 NaT
one NaN 2017-07-03 18:56:33.183
two 0.156104 NaT
two 0.594270 NaT
two NaN 2017-07-03 18:56:33.183
qux one 0.091088 NaT
one NaN 2017-07-03 18:56:33.183
two 0.795864 NaT
two NaN 2017-07-03 18:56:33.183
到目前为止一切顺利。这是预期的行为。每个 first-second 对都附加了一个新行,并且该行的 DTM 列已填充当前时间戳。
测试 2
令人惊讶的是,我无法在以下数据框 df 中重现该行为。组是ID-SEQ 组合
此df 可以使用以下方法复制:
1.复制以下内容
C1 572 5/9/2017 10:13 PE
C1 572 5/9/2017 12:24 OK
C1 579 5/9/2017 10:19 PE
C1 579 5/9/2017 13:25 OK
C1 587 5/9/2017 10:20 PE
C1 587 5/9/2017 12:25 OK
C1 590 5/9/2017 10:21 PE
C1 590 5/9/2017 13:09 OK
C1 604 5/9/2017 10:38 PE
C1 604 5/9/2017 12:32 OK
C1 609 5/9/2017 10:39 PE
C1 609 5/9/2017 13:29 OK
C1 613 5/9/2017 10:39 PE
C1 613 5/9/2017 13:08 OK
C1 618 5/9/2017 10:40 PE
C1 618 5/9/2017 13:33 OK
C1 636 5/9/2017 10:54 PE
C1 636 5/9/2017 13:36 OK
C1 642 5/9/2017 10:55 PE
C1 642 5/9/2017 13:35 OK
C1 643 5/9/2017 10:56 PE
C1 643 5/9/2017 13:34 OK
C1 656 5/9/2017 10:55 PE
C1 656 5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
C2 86 9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
C3 10 4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
C4 38 3/25/2017 10:51 OK
2.然后执行这些,
df = pd.read_clipboard(sep = '[ ]{2,}')
df.columns = ['ID', 'SEQ', 'DTM', 'STATUS']
设置多索引
d = df.set_index(['ID', 'SEQ', 'DTM']) # I have three index levels this time in the original dataframe
d 的样子,
d
Out[40]:
STATUS
ID SEQ DTM
C1 572 5/9/2017 10:13 PE
5/9/2017 12:24 OK
579 5/9/2017 10:19 PE
5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
5/9/2017 12:25 OK
590 5/9/2017 10:21 PE
5/9/2017 13:09 OK
604 5/9/2017 10:38 PE
5/9/2017 12:32 OK
609 5/9/2017 10:39 PE
5/9/2017 13:29 OK
613 5/9/2017 10:39 PE
5/9/2017 13:08 OK
618 5/9/2017 10:40 PE
5/9/2017 13:33 OK
636 5/9/2017 10:54 PE
5/9/2017 13:36 OK
642 5/9/2017 10:55 PE
5/9/2017 13:35 OK
643 5/9/2017 10:56 PE
5/9/2017 13:34 OK
656 5/9/2017 10:55 PE
5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
3/25/2017 10:51 OK
应用函数,
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
返回,(注意第四行)
dd
Out[37]:
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 5/9/2017 10:13 PE
1 C1 572.0 5/9/2017 12:24 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
579 2 C1 579.0 2017-07-03 18:56:33.183000 PE
3 C1 579.0 5/9/2017 13:25 OK
587 4 C1 587.0 5/9/2017 10:20 PE
5 C1 587.0 5/9/2017 12:25 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
590 6 C1 590.0 5/9/2017 10:21 PE
7 C1 590.0 5/9/2017 13:09 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
604 8 C1 604.0 5/9/2017 10:38 PE
9 C1 604.0 5/9/2017 12:32 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
609 10 C1 609.0 5/9/2017 10:39 PE
11 C1 609.0 5/9/2017 13:29 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
613 12 C1 613.0 5/9/2017 10:39 PE
13 C1 613.0 5/9/2017 13:08 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
618 14 C1 618.0 5/9/2017 10:40 PE
15 C1 618.0 5/9/2017 13:33 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
636 16 C1 636.0 5/9/2017 10:54 PE
17 C1 636.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
642 18 C1 642.0 5/9/2017 10:55 PE
19 C1 642.0 5/9/2017 13:35 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
643 20 C1 643.0 5/9/2017 10:56 PE
21 C1 643.0 5/9/2017 13:34 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
656 22 C1 656.0 5/9/2017 10:55 PE
23 C1 656.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C2 86 24 C2 86.0 9/5/2016 19:45 PE
25 C2 86.0 9/6/2016 11:55 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C3 10 26 C3 10.0 4/17/2017 12:23 PE
27 C3 10.0 4/17/2017 14:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C4 38 28 C4 38.0 3/25/2017 10:35 PE
29 C4 38.0 3/25/2017 10:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
一些改进,
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]
ddd
Out[39]:
DTM STATUS
ID SEQ
C1 572 5/9/2017 10:13 PE
572 5/9/2017 12:24 OK
572 2017-07-03 18:56:33.183000 NaN
579 2017-07-03 18:56:33.183000 PE
579 5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
587 5/9/2017 12:25 OK
587 2017-07-03 18:56:33.183000 NaN
590 5/9/2017 10:21 PE
590 5/9/2017 13:09 OK
590 2017-07-03 18:56:33.183000 NaN
604 5/9/2017 10:38 PE
604 5/9/2017 12:32 OK
604 2017-07-03 18:56:33.183000 NaN
609 5/9/2017 10:39 PE
609 5/9/2017 13:29 OK
609 2017-07-03 18:56:33.183000 NaN
613 5/9/2017 10:39 PE
613 5/9/2017 13:08 OK
613 2017-07-03 18:56:33.183000 NaN
618 5/9/2017 10:40 PE
618 5/9/2017 13:33 OK
618 2017-07-03 18:56:33.183000 NaN
636 5/9/2017 10:54 PE
636 5/9/2017 13:36 OK
636 2017-07-03 18:56:33.183000 NaN
642 5/9/2017 10:55 PE
642 5/9/2017 13:35 OK
642 2017-07-03 18:56:33.183000 NaN
643 5/9/2017 10:56 PE
643 5/9/2017 13:34 OK
643 2017-07-03 18:56:33.183000 NaN
656 5/9/2017 10:55 PE
656 5/9/2017 13:36 OK
656 2017-07-03 18:56:33.183000 NaN
C2 86 9/5/2016 19:45 PE
86 9/6/2016 11:55 OK
86 2017-07-03 18:56:33.183000 NaN
C3 10 4/17/2017 12:23 PE
10 4/17/2017 14:51 OK
10 2017-07-03 18:56:33.183000 NaN
C4 38 3/25/2017 10:35 PE
38 3/25/2017 10:51 OK
38 2017-07-03 18:56:33.183000 NaN
问题
包含当前时间戳的新行已附加到每个 ID-SEQ 组,C1-579 组除外! (dd 和ddd 中的第四行)
问题
- 是什么导致了这个问题?
-
dd中引入的附加索引级别是什么?
【问题讨论】:
-
如果您可以使问题包含所有必需的信息,而不是链接到其他来源并说“我正在尝试这样做”,那么您更有可能得到答复。此外,将您的代码和示例简化为问题仍然存在的最简单情况,以便人们更容易理解
-
@mjp 感谢您的建议。我已经重组了这个问题。这是一个我非常想得到答案的问题。
标签: python python-2.7 pandas dataframe pandas-groupby