【问题标题】:Remove duplicates with pandas while preserving the order [python]在保留顺序的同时用熊猫删除重复项[python]
【发布时间】:2021-06-19 02:10:17
【问题描述】:

我的 df 中有一个列,我需要从中删除区分大小写的重复项,以保持第一次出现。问题是我可能在某些行上有用“,”分隔的单词或在它们之间包含“-”。有没有办法清理这些数据同时保留顺序?

this is how my data looks like

3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Starts, Multicor

#this is how it should look like

 3sprouts Cesto de Roupa Cisne, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor

在此先感谢

【问题讨论】:

  • 为什么必须删除第二行中的'bright Starts'?(区分大小写?)并且', Rosa/Roxo' 变成',Rosa/Roxo'? (空格)
  • @SCKU 'bright Starts' 必须删除,因为句子开头有'Bright-Starts'。至于逗号和Rosa/Roxo前面的空格,没关系(我也会在描述中修改,谢谢)
  • 感谢您的回复,但我认为它应该称为“不区分大小写”还是? (如果'bright Starts'匹配'Bright-Starts',第一个B不区分大小写?)
  • @SCKU 实际上是的,如果是同一个词,无论是小写、大写、正确大小写都应该删除
  • 嗨!以下任何一个答案是否有效?如果是这样并且如果您愿意,您可以考虑accepting 其中之一向其他人发出问题已解决的信号。如果没有,您可以提供反馈,以便改进(或完全删除)

标签: python pandas dataframe duplicates


【解决方案1】:

假设:

  • 包含- 的单词不会被删除。

一些想法:

  • 区分大小写的重复项:应该区分大小写的 IMO,因此请与 .lower() 进行比较。
  • 保留第一个匹配项:删除其他匹配项。
  • 用“,”分隔的单词或它们之间包含“-”的单词:如果存在-,则拆分单词,然后剥离,进行比较
import re
import itertools

sentences = [
    '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
    'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
    'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'
]

for s in sentences: 
    s_split = s.split(' ') #keep original sentence split by ' '
    s_split_without_comma = [i.strip(',') for i in s_split]
    #get compare word split by '-' and ' ', use re or itertools
    #method 1: re
    compare_words = re.split(' |-', s)
    #method 2: itertools
    compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
    #method 3: DIY
    compare_words = []
    for i in s_split:
        compare_words += i.split('-')

    # strip ','
    compare_words_without_comma = [i.strip(',') for i in compare_words]
    
    # start to compare
    need_removed_index = []
    for word in compare_words_without_comma:
        matched_indexes = []
        for idx, w in enumerate(s_split_without_comma):
            if word.lower() in w.lower().split('-'):
                matched_indexes.append(idx)
        if len(matched_indexes) >1: #has_duplicates
            need_removed_index += matched_indexes[1:]
    need_removed_index = list(set(need_removed_index))
    
    # keep remain and join with ' '
    print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))

灵魂印记:

3sprouts Cesto de Roupa Cisne Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor

与答案相比略有不同,但我仍然无法弄清楚为什么Sprouts 也在第 1 行中被删除('3sprouts' 匹配 'sprouts'??)

没关系...只是给出一些概念。

仅供参考。

【讨论】:

    【解决方案2】:
    #sample dataframe used by me for testing:
    df=pd.DataFrame({'col': {0: '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
      1: 'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
      2: 'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'}})
    

    试试:

    out=df['col'].str.title().str.split(', ',expand=True)
    #For checking purpose
    real=df['col'].str.split(', ',expand=True)
    #for assigning purpose
    real[1]=real[1].mask(out[0].str.contains(f'({"|".join(out[1])})'))
    #checking if value in col 0 of out is present in the col 1 of out and passing that mask to real 
    real[2]=real[2].mask(out[0].str.contains(f'({"|".join(out[2])})'))
    #checking if value in col 0 of out is present in the col 2 of out and passing that mask to real 
    df['col']=real.apply(lambda x:', '.join(x.dropna()),1)
    #finally joining values by ', '
    

    df 的输出:

        col
    0   3sprouts Cesto de Roupa Cisne Sprouts, Organizador
    1   Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
    2   Bright-Starts Mordedor Twist & Teethe, Multicor
    

    【讨论】:

      猜你喜欢
      • 2016-12-28
      • 2023-01-03
      • 1970-01-01
      • 2010-10-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多