【问题标题】:Remove a list of phrase from string从字符串中删除短语列表
【发布时间】:2020-06-17 22:16:22
【问题描述】:

我有一个需要从给定句子中删除的短语(n-gram)列表。

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

我想得到:

    new_sentence = 'Oranges are the main ingredient for a wide of'

我尝试了Remove list of phrases from string,但它不起作用('Oranges' 变成 'Os','drinks' 被删除,而不是短语 'food and Drinks')

有人知道怎么解决吗?谢谢!

【问题讨论】:

  • 如果您需要处理复数,您可能应该为此使用自然语言处理库。
  • 您是否尝试遍历已删除的列表并检测每个索引是否在句子中?
  • 您可以通过将removed 列表与较长的短语优先排序来解决第二个问题。

标签: python string text


【解决方案1】:

由于您只想匹配整个单词,我认为第一步是将所有内容转换为单词列表,然后从最长到最短的短语进行迭代以找到要删除的内容:

>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
...     for i in range(len(words) - len(ngram)+1):
...         if words[i:i+len(ngram)] == ngram:
...             words = words[:i] + words[i+len(ngram):]
...             break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'

请注意,这种简单的方法存在一些缺陷——同一 n-gram 的多个副本不会被删除,但在修改 words 后也无法继续该循环(长度会有所不同),所以如果你想处理重复,你需要批量更新。

【讨论】:

  • 谢谢,@Samwise,它适用于我给定的示例。不幸的是,我的真实数据有重复,有什么办法可以克服吗?
  • 按照建议,批量更新;不要在该循环内修改words 并中断,而是将i 添加到列表中并继续。然后一次对列表中的所有内容进行修改。或者,以相反的顺序进行迭代(即反转范围),然后您可以在不中断迭代的情况下修改列表。
【解决方案2】:

正则表达式时间!

In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
     ...: removed = sorted(removed, key=len, reverse=True)
     ...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
     ...: new_sentence = sentence
     ...: import re
     ...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
     ...: for removal in removals:
     ...:     new_sentence = re.sub(removal, '', new_sentence)
     ...: new_sentence = ' '.join(new_sentence.split())
     ...: print(sentence)
     ...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of

【讨论】:

    【解决方案3】:
        import re
    
        removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
        sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
    
        # sort the removed tokens according to their length,
        removed = sorted(removed, key=len, reverse=True)
    
        # using word boundaries
        for r in removed:
            sentence = re.sub(r"\b{}\b".format(r), " ", sentence)
    
        # replace multiple whitspaces with a single one   
        sentence = re.sub(' +',' ',sentence)
    

    我希望这会有所帮助: 首先,您需要根据长度对删除的字符串进行排序,这样 'food and Drinks' 将在 'drinks' 之前被替换

    【讨论】:

      【解决方案4】:

      给你

      removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
      sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
      
      words = sentence.split()
      resultwords  = [word for word in words if word.lower() not in removed]
      result = ' '.join(resultwords)
      print(result)
      
      

      结果:

      Oranges the main ingredient for a wide of food and
      

      【讨论】:

      • 你没有删除food and drinks,这和他有同样的问题。
      猜你喜欢
      • 1970-01-01
      • 2019-05-19
      • 1970-01-01
      • 1970-01-01
      • 2016-12-13
      • 2014-09-07
      • 1970-01-01
      • 2021-03-16
      • 1970-01-01
      相关资源
      最近更新 更多