【问题标题】:Optimize groupby aggregation pandas优化 groupby 聚合 pandas
【发布时间】:2017-11-03 07:55:35
【问题描述】:

我有一个这样的数据集:

   Type Word

0   N   Work
1   N   Rock
2   N   Rock
3   Adj Rock
4   V   Rock
5   N   Work
6   V   Work
7   V   Rock
8   Adj Like
9   N   Rock
10  V   Love
11  V   Like
12  V   Rock
13  Adj Blue
14  Adv Work

我想计算每个单词的数量,并获得每个单词的前 2 种类型。 我期望的结果是这样的:

    Word    Top    Count

0   Rock    N, V    7
1   Work    N, Adv  4
2   Like    Adj, V  2
3   Blue    Adj     1
4   Love    V       1

我创建了几行代码,得到了预期的结果。 这是我的代码:

In [1]: 
import pandas as pd
df = pd.DataFrame([
    ['N','Work'],
    ['N','Rock'],
    ['N','Rock'],
    ['Adj','Rock'], 
    ['V','Rock'],
    ['N','Work'],
    ['V','Work'],
    ['V','Rock'],
    ['Adj','Like'],
    ['N','Rock'],
    ['V','Love'],
    ['V','Like'],
    ['V','Rock'],
    ['Adj','Blue'],
    ['Adv','Work']], columns=['Type', 'Word'])

In [2]: #Group by column "Word","Type" and count number of each pair
df = df.groupby(["Type", "Word"])["Type"].count().reset_index(name="Count")

In [3]:
df
   Type Word    Count
0   Adj Blue    1
1   Adj Like    1
2   Adj Rock    1
3   Adv Work    1
4   N   Rock    3
5   N   Work    2
6   V   Like    1
7   V   Love    1
8   V   Rock    3
9   V   Work    1

In [4]: #Group by "Word" and sort by "Count" in each group, get top 2
df1 = df.sort_values(["Word","Count"], ascending=False).groupby("Word").head(2)
df1
   Type Word    Count
5   N   Work    2
3   Adv Work    1
4   N   Rock    3
8   V   Rock    3
7   V   Love    1
1   Adj Like    1
6   V   Like    1
0   Adj Blue    1

In [5]: #Groupby "Word" and union "Type" in each group
df1 = df1.groupby('Word')['Type'].apply(lambda x: "%s" % ', '.join(x)).reset_index(name='Top')
df1
    Word    Top
0   Blue    Adj
1   Like    Adj, V
2   Love    V
3   Rock    N, V
4   Work    N, Adv

In [6]: #Compute number of each word, save to a new dataframe
df_sum = df.groupby('Word').sum().reset_index()
df_sum
    Word    Count
0   Blue    1
1   Like    2
2   Love    1
3   Rock    7
4   Work    4

In [7]: #Merge to dataframe containing number of each word
df1.merge(df_sum).sort_values("Count", ascending=False)
df1
    Word    Top     Count
3   Rock    N, V    7
4   Work    N, Adv  4
1   Like    Adj, V  2
0   Blue    Adj     1
2   Love    V       1

但是,这段代码似乎不是最优的。我用了很多groupby,用了2次sort_values。如果数据集实际上很大,那将很麻烦。你能优化它吗? 谢谢。

【问题讨论】:

    标签: python pandas group-by


    【解决方案1】:
    df.groupby('Word').agg(dict(
            Type=lambda x: ', '.join(pd.value_counts(x).index[:2]),
            Word='size'
        )).rename(columns=dict(Word='Count')).reset_index().sort_values('Count')
    
       Word    Type  Count
    0  Blue     Adj      1
    2  Love       V      1
    1  Like  V, Adj      2
    4  Work    N, V      4
    3  Rock    N, V      7
    

    【讨论】:

      【解决方案2】:

      您可以使用agg 后跟Counter 来获取最常见的类型,并使用len 来计算出现的单词数。

      import pandas as pd
      from collections import Counter    
      
      group_df = df.groupby('Word')
      df_summary = group_df.agg(
          lambda x: {'Type': [', '.join([e[0] for e in Counter(x.Type).most_common(2)]), len(x)]}
      )
      df_out = df_summary.Type.apply(pd.Series).reset_index().rename(columns={0: 'Top', 1: 'count'})
      df_out.sort_values('count', ascending=False) # output
      

      这将输出数据帧为

          Word    Top count
      3   Rock    N, V    7
      4   Work    N, V    4
      1   Like    Adj, V  2
      0   Blue    Adj 1
      2   Love    V   1
      

      【讨论】:

        猜你喜欢
        • 2014-11-23
        • 2022-01-21
        • 2019-08-10
        • 2021-09-19
        • 1970-01-01
        • 1970-01-01
        • 2020-11-19
        • 2020-06-12
        • 2018-09-29
        相关资源
        最近更新 更多