熊猫数据框获取每组的第一行答案

【问题标题】：Pandas dataframe get first row of each group熊猫数据框获取每组的第一行
【发布时间】：2013-12-02 18:35:51
【问题描述】：

我有一个熊猫DataFrame，如下所示。

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

我想按 ["id","value"] 对其进行分组，并获取每个组的第一行。

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

我尝试了以下，它只给出了DataFrame 的第一行。对此的任何帮助表示赞赏。

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

【问题讨论】：

我意识到这个问题已经很老了，但我建议接受@vital_dml 的答案，因为first() 对 nans 的行为是非常令人惊讶的我想大多数人都不会期待。

标签： python pandas dataframe group-by row

【解决方案1】：

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

如果您需要id 作为列：

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

要获取 n 条第一条记录，可以使用 head()：

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

【讨论】：

非常感谢！运作良好:) 不可能以同样的方式获得第二行，对吧？你也能解释一下吗？
g = df.groupby(['session']) g.agg(lambda x: x.iloc[0]) 这也有效，不知道获得第二个值？ :(
假设从顶部开始计数你想得到行号 top_n，然后 dx = df.groupby('id').head(top_n).reset_index(drop=True) 并假设从你想得到行号bottom_n的底部，然后dx = df.groupby('id').tail(bottom_n).reset_index(drop=True)
如果您想要最后 n 行，请使用 tail(n)（默认为 n=5）（ref.）。不要与last() 混淆，我犯了那个错误。
groupby('id',as_index=False) 也将id 保留为一列

【解决方案2】：

这将为您提供每组的第二行（零索引，nth(0) 与 first() 相同）：

df.groupby('id').nth(1)

文档：http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

【讨论】：

如果您想要多个，例如前三个，请使用nth((0,1,2)) 或nth(range(3)) 之类的序列。
@RonanPaixão：不知何故，当我给出范围时，它会抛出一个错误：TypeError: n needs to be an int or a list/set/tuple of ints
@Peaceful：你在使用 Python 3 吗？如果是这样，range(3) 不会返回列表，除非您键入 list(range(3))。

【解决方案3】：

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)

                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31

df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

【讨论】：

【解决方案4】：

如果您需要获得第一行，我建议使用.nth(0) 而不是.first()。

它们之间的区别在于它们处理 NaN 的方式，因此.nth(0) 将返回 group 的第一行，无论该行中的值是什么，而 .first() 最终将返回第一行 not NaN 每列的值。

例如如果您的数据集是：

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

和

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

【讨论】：

好点。 .head(1) 的行为似乎也像 .nth(0)，除了索引
另一个区别是 nth(0) 将保留原始索引（如果 as_index=False），而 first() 不会。曾经对我来说这是一个很大的区别，因为我需要索引本身.
这似乎是最明确的答案。对具有混合数据类型的 groupby 列具有鲁棒性。

【解决方案5】：

如果您只需要每个组的第一行，我们可以使用drop_duplicates，注意函数默认方法keep='first'。

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

【讨论】：

【解决方案6】：

考虑到'id'列是数字类型，如int32/int64，也可以使用groupby.rank()如下

[In]: df[df.groupby('value')['id'].rank() == 1]
[Out]:
   id   value
0   1   first
6   3   third
7   3  fourth
8   3   fifth

如果要重置索引，只需传递.reset_index()等

[In]: df[df.groupby('value')['id'].rank() == 1].reset_index()
[Out]:
   index  id   value
0      0   1   first
1      6   3   third
2      7   3  fourth
3      8   3   fifth

如果不需要 index 和 id 列

[In]: df.drop(['index', 'id'], axis=1, inplace=True)
[Out]:
    value
0   first
1   third
2  fourth
3   fifth

【讨论】：

【解决方案7】：

我想“第一”意味着您已经按照自己的意愿对 DataFrame 进行了排序。

我要做的是：

df.groupby('id').agg('first') 我想“第一”意味着您已经根据需要对 DataFrame 进行了排序。我要做的是：

df.groupby('id').agg('first')
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

好处是你可以插入任何你想要的功能：

df.groupby('id').agg(['first','last','count']))
     value              
     first    last count
id                      
1    first  second     3
2    first  second     2
3    first   fifth     4
4   second   fifth     2
5    first   first     1
6    first   third     3
7   fourth   fifth     2

输出 DataFrame 有 MultiIndex 列

MultiIndex([('value', 'first'),
            ('value',  'last'),
            ('value', 'count')],
           )

【讨论】：