您提到了 pandas,这是一种 pandas 方法:
import pandas as pd
# create sample data frame
data = [
(1, 2, 3),
(1, 4, 5),
(1, 6, 7),
(9, 10, 11),
(9, 12, 13),
]
df = pd.DataFrame(data, columns=('x', 'y', 'z'))
# keep rows with value in column 'x' appears at most 'ceiling' times
ceiling = 2
low_freq = df['x'].value_counts().loc[lambda x: x <= ceiling].index
# use boolean mask to find rows such that 'x' is in our low_freq list
mask = df['x'].isin(low_freq)
# print results
print(df[mask])
x y z
3 9 10 11
4 9 12 13
# use df[mask].to_csv(...) to write to csv file
更新:
这是一种“拆开”上述代码的方法。例如,low_freq 是什么?这让您可以看到转换中的每个步骤——因此您可以修改/扩展方法。
df['x']
df['x'].value_counts()
df['x'].value_counts().loc[lambda x: x <= ceiling]
df['x'].value_counts().loc[lambda x: x <= ceiling].index
更新 2
显然过滤逻辑没有按预期工作。让我们尝试不同的方法:
import pandas as pd
# create sample data frame
data = [(0, 1, 2, ), (1, 1, 4, ), (2, 1, 6, ),
(3, 9, 10,), (4, 9, 12,), (5, 7, 21,)]
df = (pd.DataFrame(data, columns=('pos_id', 'device_id', 'base_mac'))
.set_index('pos_id'))
现在使用groupby() 计算每个device_id 的出现次数。此计数进入一个新列。
df['dev_id_count'] = (df.groupby('device_id')['device_id']
.transform('count'))
print(df)
device_id base_mac dev_id_count
pos_id
0 1 2 3
1 1 4 3
2 1 6 3
3 9 10 2
4 9 12 2
5 7 21 1
最后一步是根据这个新列进行过滤:
mask = df['dev_id_count'] <= 2
print(df[mask])
# output not shown, to save space