取 nlargest 5 并对 pandas 中的其余部分求和/计数答案

【问题标题】：Take nlargest 5 and sum/count the rest in pandas取 nlargest 5 并对 pandas 中的其余部分求和/计数
【发布时间】：2024-01-17 17:11:02
【问题描述】：

我的数据集如下所示：

ID   |    country
1    |    USA
2    |    USA
3    |    Zimbabwe
4    |    Germany

我执行以下操作来获取第一个国家/地区的名称及其对应的值。所以在我的情况下是：

df.groupby(['country']).country.value_counts().nlargest(5).index[0]
df.groupby(['country']).country.value_counts().nlargest(5)[0]
df.groupby(['country']).country.value_counts().nlargest(5).index[1]
df.groupby(['country']).country.value_counts().nlargest(5)[1]
etc.

输出将是：

(USA), 388
(DEU), 245
etc.

然后我重复它，直到我获得数据集中的前 5 个国家/地区。

但是，我怎样才能获得一个“其他”或“其他”列，将所有其他国家/地区归为一类。所以像下面这样的国家在我的数据集中并不常见：

津巴布韦、伊拉克、马来西亚、肯尼亚、澳大利亚等

所以我想要一个输出如下所示的第六个值：

（其他），3728

如何在 pandas 中实现这一点？

【问题讨论】：

相关：Python: Combining Low Frequency Factors/Category Counts
相关：Rename the less frequent categories by “OTHER” python

标签： python python-3.x pandas count series

【解决方案1】：

用途：

N = 5
#get counts of column
s = df.country.value_counts()
#select top 5 values
out = s.iloc[:N]
#add sum of another values
out.loc['Other'] = s.iloc[N:].sum()

如果需要，最后一个 2 列 DataFrame：

df = out.reset_index()
df.columns=['country','count']

【讨论】：

【解决方案2】：

用'Other' 替换频率较低的国家/地区之前使用value_counts。实现此目的的一种有效方法是通过Categorical Data。如果您想保留原始数据，请使用副本，例如new_country_series = df['country'].copy().

# convert series to categorical
df['country'] = df['country'].astype('category')

# extract labels
others = df['country'].value_counts().index[5:]
label = 'Other'

# apply new category label
df['country'] = df['country'].cat.add_categories([label])
df['country'] = df['country'].replace(others, label)

然后提取国家及其计数：

for country, count in df['country'].value_counts():
    print(country, count)

【讨论】：

for 循环出现错误：TypeError: 'int' object is not iterable