Python检测数据框列中的重复字符串答案

【问题标题】：Python Detecting duplicated strings in a dataframe columnPython检测数据框列中的重复字符串
【发布时间】：2022-01-01 08:30:04
【问题描述】：

我在大学一栏下，从这个人的简历中提取了多个大学字符串。但结果包含许多骗局。这就是大学下的价值观['华盛顿大学理学士'、'华盛顿大学'、'信息学院本科研究员']。我想用逗号分隔字符串以获取每个单独的字符串，并且仅在字符串不包含在另一个字符串中时才包含该字符串。所以在这种情况下只有华盛顿大学科学与信息学院的本科研究员。这是我在这一行尝试的代码。而且它不会产生正确的结果。所以我尝试先用逗号分隔长字符串并遍历拆分元素。在循环中我将检查我的列表是否为空，如果是，则附加第一个元素，如果不是，则检查列表中的现有字符串是否包含新元素，如果是，则不执行任何操作，如果不是，则检查元素是否包含现有的 then 是否替换或添加新元素。

for i,row in df.iterrows():
lst=[]
element=row.UNIVERSITY.split(',')
for candidate in element:
    if not lst:
        lst.append(candidate)
        print('This is the first item'+ candidate)
        print('current lst:'+ str(lst))
    else:
        for existing in lst:
            print('This is what we have in the list now'+ existing)
            if existing in candidate:
                lst.remove(existing)
                lst.append(candidate)
                print('dupe detected')
            elif candidate in existing:
                continue
            else:
                st.append(candidate)
                print('new item'+ candidate)
    
    
    
print(lst)

对于我得到的输出：

This is the first item['university of washington bachelor of science'
current lst:["['university of washington bachelor of science'"]
This is what we have in the list now['university of washington bachelor of science'
new item 'university of washington'
This is what we have in the list now['university of washington bachelor of science'
new item 'information school undergraduate researcher']
["['university of washington bachelor of science'"]

我不知道为什么华盛顿大学会被传递到 else 循环中，因此会看到“华盛顿大学的新项目”，我也不知道为什么信息学院本科研究员不在最终名单中。这里有什么问题？

【问题讨论】：

标签： string loops split duplicates

【解决方案1】：

我看到一个问题，如果您的列表是 ["hello","there"] 并且您添加“hello there”，它将删除 hello 并添加“hello there”但忽略“there”

这是因为您的 for 循环会在循环时执行操作，然后再检查每个元素。要修复它，我建议循环两次

boolean good=true
//check if candidate is a substring of an element
for element in lst:
   if candidate in existing:
      //candidate is bad, move to next candidate, however you want to do that
      good=false
if(good):
   //now we know we will add candidate, so we remove all elements that are a substring of candidate:
   for element in lst:
      if element in candidate:
         st.remove(element)
   //now we can add candidate (we do this last so we don't remove it when <element=candidate> in candidate = true)
   st.add(candidate)

【讨论】：

非常感谢！我会试试这个！