删除 pandas 列中的重复值，但忽略一个值答案

【问题标题】：Remove duplicate values in a pandas column, but ignore one value删除 pandas 列中的重复值，但忽略一个值
【发布时间】：2020-10-22 11:01:57
【问题描述】：

我确信有一个优雅的解决方案，但我找不到。在 pandas 数据框中，如何删除列中的所有重复值而忽略一个值？

repost_of_post_id                                              title
0        7139471603    Man with an RV needs a place to park for a week   
1        6688293563                                     Land for lease   
2              None                  2B/1.5B, Dishwasher, In Lancaster   
3              None  Looking For Convenience? Check Out Cordova Par...   
4              None  2/bd 2/ba, Three Sparkling Swimming Pools, Sit...   
5              None  1 bedroom w/Closet is bathrooms in Select Unit...   
6              None  Controlled Access/Gated, Availability 24 Hours...   
7              None         Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent   
8        7143099582                        Need Help Getting Approved?   
9              None            *MOVE IN READY APT* REQUEST TOUR TODAY!

我想要的是将所有None 值保留在repost_of_post_id 中，但省略任何重复的数值，例如，如果数据框中有重复的7139471603。

[更新] 我使用这个脚本得到了想要的结果，但如果可能的话，我想用单线来完成。

# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned

ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")

ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)

【问题讨论】：

您使用的是哪个版本的熊猫？
@RafaelBarros pandas==1.0.4
你能试试这样的方法，让我知道它是否有效吗？ repost_of_post_id = repost_of_post_id[(~repost_of_post_id.duplicated()) | repost_of_post_id.isna()]

标签： python python-3.x pandas numpy dataframe

【解决方案1】：

您可以尝试删除 None 值，然后检测重复项，然后将它们从原始列表中过滤掉。

In [1]: import pandas as pd 
   ...: from string import ascii_lowercase 
   ...:  
   ...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5] 
   ...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])}) 
   ...: print(df) 
   ...:  
   ...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])                                 
     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
6   2.0     g
7   3.0     h
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

【讨论】：

【解决方案2】：

您可以使用 drop_duplicates 并与 NaN 合并，如下所示：

df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')

这将保持第一次出现的 id 重复和所有 NaN 行。

【讨论】：