【问题标题】:Creating a new List out of a List but without dublicate items?从列表中创建一个新列表但没有重复项?
【发布时间】:2020-12-19 19:37:02
【问题描述】:

我有一个列表

carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

现在我想摆脱重复的项目。问题是,双精度项仅在字符串中的特定点 i[38:] 处彼此不同。

我的想法是创建一个 for 循环:

new_list = []
for i in carner_list:
       if i[38:] in new_list:
           print("found")
       else:
           new_list = new_list + [i]
           print("not")

但这不起作用。

是语法有问题还是我完全走错了路?

最好的拉塞尔

【问题讨论】:

  • 您需要带有独特文字的商品吗? Damselfly, The Isle of Arran?
  • 您当前的代码有什么输出错误?
  • 当您检查来自carner_list 的项目是否也在new_list 中时,这将始终评估为False,因为new_list 是空的

标签: python string list for-loop if-statement


【解决方案1】:

我输入了一个名为listContains 的小函数,我认为它可以解决您的问题。您的代码不起作用,因为您在 new_list 中搜索值 i[38:],而在 new_list 中,您附加了 i 的整个值。
因此,您还应该对列表的每个值应用 [38:] 规则。
我认为下面的代码可以更好地解释我在说什么:

carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
 '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
 '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
new_list = []

def listContains(myList, toSearch):
  for val in myList:
    if val[38:] == toSearch:
      return True
  return False

for i in carner_list:
  if listContains(new_list, i[38:]):
    print("found")
  else:
    new_list.append(i)
    print("not")
print(new_list)

如果你想测试它,你可以通过here 进行测试

【讨论】:

  • 非常感谢乔瓦尼。这正是问题所在。可以解决。
  • 很高兴它解决了您的问题,请选择我的答案作为您的解决方案,这样未来的用户也可以立即使用它。谢谢!
【解决方案2】:

用于确定重复的字符串部分(从索引 38 到末尾)不是您实际存储在列表中的部分,因此 in 运算符将不起作用。

您可以改为使用 dict 来存储去重的字符串,将您关心的字符串部分作为键,以便 in 运算符可以工作:

new = {}
for i in carner_list:
    key = i[38:]
    if key not in new:
        new[key] = i
print(list(new.values()))

这个输出:

['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>', '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>', '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']

【讨论】:

    【解决方案3】:

    因此,按照您当前搜索的方式,您正在查看子字符串是否等于 new_list 中的任何内容。这永远不会是真的,因为它是一个子字符串。

    您可以使用 lambda,然后对其进行过滤以获得真实结果,以查看该项目是否在新列表中。然后将其转换为一个列表并检查该列表的长度是否不等于 0。

    len(list(filter(lambda x: i[38:] in x, new_list))) != 0
    

    最终代码

    carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
     '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
     '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
     '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
     '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
    
    
    new_list = []
    
    for i in carner_list:
        if len(list(filter(lambda x: i[38:] in x, new_list))) != 0:
            print("found")
        else:
            new_list.append(i)
            print("not")
    

    【讨论】:

      【解决方案4】:

      使用BeautifulSoup解析html然后检查

      例如:

      from bs4 import BeautifulSoup
      
      carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
       '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
       '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
       '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
       '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
      
      new_list = []
      check_val = set()
      for i in carner_list:
          s = BeautifulSoup(i, "html.parser")
          if s.text not in check_val:    #check for text
              new_list.append(i)
              check_val.add(s.text)
      print(new_list)
      

      输出:

      ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
       '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of '
       'Arran</a>',
       '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the '
       'Morning</a>']
      

      【讨论】:

        【解决方案5】:

        为什么不使用正则表达式

        import re
        carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>',
         '<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>',
         '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>',
         '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>',
         '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
        
        print({re.findall(r'"([^"]*)"', x)[0].split("/")[4]: x for x in carner_list })
        
        #Below is the output generated 
        '''
        {'Damselfly': '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>', 'The+Isle+of+Arran': '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>', 'Mean+It+in+the+Morning': '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>'}
        '''
        

        【讨论】:

          猜你喜欢
          • 2016-05-18
          • 1970-01-01
          • 2016-05-09
          • 1970-01-01
          • 2021-08-04
          • 2020-02-27
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多