【问题标题】:Remove duplicate values in a pandas column, but ignore one value删除 pandas 列中的重复值,但忽略一个值
【发布时间】:2020-10-22 11:01:57
【问题描述】:

我确信有一个优雅的解决方案,但我找不到。在 pandas 数据框中,如何删除列中的所有重复值而忽略一个值?

repost_of_post_id                                              title
0        7139471603    Man with an RV needs a place to park for a week   
1        6688293563                                     Land for lease   
2              None                  2B/1.5B, Dishwasher, In Lancaster   
3              None  Looking For Convenience? Check Out Cordova Par...   
4              None  2/bd 2/ba, Three Sparkling Swimming Pools, Sit...   
5              None  1 bedroom w/Closet is bathrooms in Select Unit...   
6              None  Controlled Access/Gated, Availability 24 Hours...   
7              None         Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent   
8        7143099582                        Need Help Getting Approved?   
9              None            *MOVE IN READY APT* REQUEST TOUR TODAY!   

我想要的是将所有None 值保留在repost_of_post_id 中,但省略任何重复的数值,例如,如果数据框中有重复的7139471603


[更新] 我使用这个脚本得到了想要的结果,但如果可能的话,我想用单线来完成。

# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned

ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")

ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)

【问题讨论】:

  • 您使用的是哪个版本的熊猫?
  • @RafaelBarros pandas==1.0.4
  • 你能试试这样的方法,让我知道它是否有效吗? repost_of_post_id = repost_of_post_id[(~repost_of_post_id.duplicated()) | repost_of_post_id.isna()]

标签: python python-3.x pandas numpy dataframe


【解决方案1】:

您可以尝试删除 None 值,然后检测重复项,然后将它们从原始列表中过滤掉。

In [1]: import pandas as pd 
   ...: from string import ascii_lowercase 
   ...:  
   ...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5] 
   ...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])}) 
   ...: print(df) 
   ...:  
   ...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])                                 
     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
6   2.0     g
7   3.0     h
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

【讨论】:

    【解决方案2】:

    您可以使用 drop_duplicates 并与 NaN 合并,如下所示:

    df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
    

    这将保持第一次出现的 id 重复和所有 NaN 行。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-05-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-06-20
      • 1970-01-01
      • 2018-04-16
      相关资源
      最近更新 更多