【发布时间】:2020-03-27 17:17:08
【问题描述】:
具有重复商店 ID 的数据框,其中一些商店 ID 出现两次,有些出现三次:
我只想根据分配给其区域的最短商店距离来保留唯一的商店 ID。
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
使用 pandas drop_duplicates 删除重复行,但条件基于第一个/最后一个出现的商店 ID,这不允许我按距离排序:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
我也尝试按 Shop ID 分组然后排序,但排序返回错误:重复
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
到目前为止,我一直在努力做到这一点:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
我想我做了很长的路,但仍然没有弄清楚删除重复项,有人知道如何以更短的方式解决这个问题吗?我是python新手,感谢您的帮助!
【问题讨论】:
-
先尝试按 Shop ID 和 Distance 排序_values,默认为ascending=True,然后对 Shop ID 和 Distance 子集进行 drop_duplicates。
标签: python pandas dataframe drop-duplicates