【问题标题】:Python extract sentence containing wordPython提取包含单词的句子
【发布时间】:2013-04-08 14:37:31
【问题描述】:

我正在尝试从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它正在返回我:

[".I like to eat apple. Me too. Let's go buy some apples."]

而不是:

[".I like to eat apple., "Let's go buy some apples."]

有什么帮助吗?

【问题讨论】:

    标签: python regex text-segmentation


    【解决方案1】:

    你可以使用str.split

    >>> txt="I like to eat apple. Me too. Let's go buy some apples."
    >>> txt.split('. ')
    ['I like to eat apple', 'Me too', "Let's go buy some apples."]
    
    >>> [ t for t in txt.split('. ') if 'apple' in t]
    ['I like to eat apple', "Let's go buy some apples."]
    

    【讨论】:

      【解决方案2】:
      In [7]: import re
      
      In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."
      
      In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
      Out[9]: ['I like to eat apple', " Let's go buy some apples"]
      

      但请注意@jamylak 的基于split 的解决方案更快:

      In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
      1000000 loops, best of 3: 1.96 us per loop
      
      In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
      1000000 loops, best of 3: 819 ns per loop
      

      对于较大的字符串,速度差异较小,但仍然很重要:

      In [24]: txt = txt*10000
      
      In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
      100 loops, best of 3: 8.49 ms per loop
      
      In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
      100 loops, best of 3: 6.35 ms per loop
      

      【讨论】:

        【解决方案3】:

        不需要正则表达式:

        >>> txt = "I like to eat apple. Me too. Let's go buy some apples."
        >>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
        ['I like to eat apple.', " Let's go buy some apples."]
        

        【讨论】:

        • @user2187202 如果您愿意,您可以接受我的回答,或者如果这确实是您需要的,则可以接受正则表达式解决方案,因为您确实将其标记为正则表达式问题,我不确定这是否必要或虽然不是
        【解决方案4】:
        In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
        Out[4]: ['I like to eat apple.', " Let's go buy some apples."]
        

        【讨论】:

        • 我怎样才能通过添加边界只得到apple:['我喜欢吃苹果。']
        【解决方案5】:
        r"\."+".+"+"apple"+".+"+"\."
        

        这行有点奇怪;为什么要连接这么多单独的字符串?你可以只使用 r'..+apple.+.'.

        无论如何,正则表达式的问题在于它的贪心。默认情况下,x+ 将尽可能频繁地匹配x。因此,您的 .+ 将匹配尽可能多的字符(any 个字符);包括点和apples。

        你想使用的是非贪婪表达式;您通常可以通过在末尾添加? 来做到这一点:.+?

        这将使您得到以下结果:

        ['.I like to eat apple. Me too.']
        

        如您所见,您不再获得两个苹果句子,但仍然获得 Me too.。那是因为你在apple之后仍然匹配.,所以不可能不捕获下面的句子。

        一个有效的正则表达式是这样的:r'\.[^.]*?apple[^.]*?\.'

        在这里你看不到任何个字符,而只看到那些本身不是点的字符。我们还允许根本不匹配任何字符(因为在第一句中的apple 之后没有非点字符)。使用该表达式会导致:

        ['.I like to eat apple.', ". Let's go buy some apples."]
        

        【讨论】:

          【解决方案6】:

          显然,有问题的样本是extract sentence containing substring,而不是
          extract sentence containing word。如何通过python解决extract sentence containing word问题如下:

          一个词可以在句子的开头|中间|结尾。不限于问题中的示例,我将提供一个在句子中搜索单词的通用功能:

          def searchWordinSentence(word,sentence):
              pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
              if re.search(pattern,sentence):
                  return True
          

          限于问题中的例子,我们可以这样解决:

          txt="I like to eat apple. Me too. Let's go buy some apples."
          word = "apple"
          print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]
          

          对应的输出是:

          ['I like to eat apple']
          

          【讨论】:

            【解决方案7】:
            import nltk
            search = "test"
            text = "This is a test text! Best text ever. Cool"
            contains = [s for s in nltk.sent_tokenize(text) if search in s]
            

            【讨论】:

              猜你喜欢
              • 2013-09-02
              • 1970-01-01
              • 2014-07-11
              • 1970-01-01
              • 2017-04-27
              • 1970-01-01
              • 2021-01-22
              相关资源
              最近更新 更多