【问题标题】:Regex python: Return words surrounding character正则表达式 python:返回字符周围的单词
【发布时间】:2026-02-09 08:35:01
【问题描述】:

我有一个包含数百万个单词的字符串,我想要一个正则表达式,它可以返回任何美元符号周围的五个单词。例如:

string = 'I have a sentence with $10.00 within it and this sentence is done. '

我希望正则表达式返回

surrounding = ['I', 'have', 'a', 'sentence', 'with', 'within', 'it', 'and', 'this', 'sentence']

我的最终目标是统计所有围绕提及“$”的单词,因此上述列表将包含以下内容:

final_return = [('I', 1), ('have', 1), ('a', 1), ('sentence', 2), ('with', 1), ('within', 1), ('it', 1), ('and', 1), ('this', 1)]

到目前为止我开发的下面的正则表达式可以返回附加到货币符号的字符串以及周围的 5 个字符。有没有办法编辑正则表达式来捕获周围的五个单词?我应该(如果是的话,如何)使用 NLTK 的标记器来实现这一点?

   import re
 .....\$\s?\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{1,2})?.....

【问题讨论】:

  • 你能导入regex模块吗?

标签: python regex python-3.x tokenize


【解决方案1】:

使用split拆分单词,用isalpha删除非单词,然后统计list中单词的频率。

string='I have a sentence with $10.00 within it and this sentence is done. '
string1=string.split()
string2=[s for s in string1 if s.isalpha()]
[[x,string2.count(x)] for x in set(string2)] 
#[['and', 1], ['within', 1], ['sentence', 2], ['it', 1], ['a', 1], ['have', 1], ['with', 1], ['this', 1], ['is', 1], ['I', 1]]

【讨论】:

  • 非常感谢!这真的很有帮助。无论如何,我可以按数字顺序返回这些单词吗?
【解决方案2】:

您可以开始使用下面的代码,我正在尝试以更简单的方式解决它。

import re

string = 'I have a sentence with $10.00 within it and this sentence is done. '

surrounding  = re.search(r'(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*\$\d+\.?\d{2}?\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)', string, flags=0).groups()

print(surrounding )

【讨论】:

    【解决方案3】:

    我不认为正则表达式是解决这个问题的正确选择。相反,您可以提取围绕美元符号的所有 10 个单词,循环遍历单词并跟踪五个先前遍历的单词,以便在找到匹配项时返回。

    在这种情况下,您可以使用collections.deque(),它是一种适当的数据结构,项目数量有限,可以保留五个先前的单词。然后您可以使用collections.Counter() 对象返回阈值内的单词计数器。

    from collections import deque
    from collections import Counter
    from itertools import chain
    
    def my_counter(string):
        container = deque(maxlen=5)
        words = iter(string.split())
        def next_five(words):
            for _ in range(5):
                try:
                    yield next(words)
                except StopIteration:
                    pass
    
        for w in words:
            if w.startswith('$'):
                yield Counter(chain(container, next_five(words)))
            else:
                container.append(w)
    

    演示:

    In [8]: s =  ' extra1 extra2 I have a sentence with $10.00 within it and this sentence is done.asdf asdf a b c d e $5 k j n m k gg ee'
    
    In [9]: 
    
    In [9]: list(my_counter(s))
    Out[9]: 
    [Counter({'I': 1,
              'a': 1,
              'and': 1,
              'have': 1,
              'it': 1,
              'sentence': 2,
              'this': 1,
              'with': 1,
              'within': 1}),
     Counter({'a': 1,
              'b': 1,
              'c': 1,
              'd': 1,
              'e': 1,
              'j': 1,
              'k': 2,
              'm': 1,
              'n': 1})]
    

    【讨论】:

      【解决方案4】:

      您可以将正则表达式与计数器结合起来,如下所示:

      (?P<before>(?:\w+\W+){5})
      \$\d+(?:\.\d+)?
      (?P<after>(?:\W+\w+){5})
      

      a demo on regex101.com


      Python:
      from collections import Counter
      import re
      
      rx = re.compile(r'''
          (?P<before>(?:\w+\W+){5})
          \$\d+(?:\.\d+)?
          (?P<after>(?:\W+\w+){5})
          ''', re.VERBOSE)
      
      sentence = 'I have a sentence with $10.00 within it and this sentence is done. '
      words = [Counter(m.group('before').split() + m.group('after').split())
                          for m in rx.finditer(sentence)]
      print(words)
      


      这会产生(注意 Counter 已经是 dict):
      [Counter({'sentence': 2, 'I': 1, 'have': 1, 'a': 1, 'with': 1, 'within': 1, 'it': 1, 'and': 1, 'this': 1})]
      

      【讨论】:

        最近更新 更多