如果值计数低于阈值，则将列值映射到“杂项” - 分类列 - Pandas Dataframe答案

【问题标题】：map column values to 'miscellaneous' if value counts is below a threshold - Categorical Column - Pandas Dataframe如果值计数低于阈值，则将列值映射到“杂项” - 分类列 - Pandas Dataframe
【发布时间】：2018-09-05 19:44:23
【问题描述】：

我有一个形状为 ~ [200K, 40] 的 pandas 数据框。数据框有一个分类列（其中之一），包含超过 1000 个唯一值。我可以使用以下方法可视化每个此类唯一列的值计数：

df['column_name'].value_counts()

我现在如何将价值观融入：

value_count 小于阈值，例如 100，并将它们映射到，例如，“杂项”？
OR 基于累积行数 %？

【问题讨论】：

标签： python pandas

【解决方案1】：

您可以从value_counts 的索引中提取要屏蔽的值，然后使用replace 将它们映射到“杂项”：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])

frequencies = df['A'].value_counts()

condition = frequencies<200   # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')

df['A'] = df['A'].replace(mask_dict)  # or you could make a copy not to modify original data

现在，使用 value_counts 会将低于阈值的所有值分组为杂项：

df['A'].value_counts()

df['A'].value_counts()
Out[18]: 
miscellaneous    947
3                226
1                221
0                204
7                201
2                201

【讨论】：

【解决方案2】：

我认为需要：

df = pd.DataFrame({ 'A': ['a','a','a','a','b','b','b','c','d']})

s = df['A'].value_counts()
print (s)
a    4
b    3
d    1
c    1
Name: A, dtype: int64

如果需要对下面的所有值求和threshold:

threshold = 2

m = s < threshold
#filter values under threshold
out = s[~m]
#sum values under and create new values to Series
out['misc'] = s[m].sum()
print (out)
a       4
b       3
misc    2
Name: A, dtype: int64

但如果需要rename索引值低于阈值：

out = s.rename(dict.fromkeys(s.index[s < threshold], 'misc'))
print (out)
a       4
b       3
misc    1
misc    1
Name: A, dtype: int64

如果需要替换原始列，请使用GroupBy.transform 与numpy.where：

df['A'] = np.where(df.groupby('A')['A'].transform('size') < threshold, 'misc', df['A'])
print (df)

      A
0     a
1     a
2     a
3     a
4     b
5     b
6     b
7  misc
8  misc

【讨论】：

【解决方案3】：

另一种解决方案：

cond = df['col'].value_counts()
threshold = 100
df['col'] = np.where(df['col'].isin(cond.index[cond >= threshold ]), df['col'], 'miscellaneous')

【讨论】：