【问题标题】:Extracting 25 words to both sides of a word from a text从文本中提取 25 个单词到单词的两侧
【发布时间】:2019-02-09 17:25:04
【问题描述】:

我有以下文本,我正在尝试使用此模式在匹配项的每一侧提取 25 个单词。挑战在于匹配重叠,因此 python 正则表达式引擎只需要一个匹配。如果有人可以帮助解决此问题,我将不胜感激

文字

2015 年展望 本公司目前提供以下 2015 年展望以代替正式的财务指引。该展望不包括任何未来收购和交易相关成本的影响。收入 - 根据 2014 年第四季度的收入、我们一些设施的新项目的增加以及之前对重要场所的收购,公司预计当前 100 项的利用率将保持在某个平均水平

我尝试了以下模式

pattern = r'(?<=outlook\s)((\w+.*?){25})'

这会创建一个匹配,而我需要两个匹配,并且一个是否与另一个重叠并不重要

我基本上需要两根火柴

【问题讨论】:

  • 你试过re.findall吗?
  • 这里的预期匹配是什么?
  • 你必须为此使用正则表达式吗?
  • 我不必使用正则表达式,但我愿意。预期匹配是两个字符串,在“outlook”之后包含 25 个单词,因为有 2 个前景。
  • 试试re.findall(r'(?=outlook\s+(\w+(?:\W+\w+){25}))', s)。如果少于 25 个单词,请将 {25} 替换为 {1,25} 甚至 {0,25}

标签: python regex python-3.x regex-lookarounds


【解决方案1】:

我根本不会使用正则表达式 - python module re 不处理重叠范围...

text = """2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"""

lookfor = "outlook"

# split text at spaces
splitted = text.lower().split()

# get the position in splitted where the words match (remove .,-?! for comparison) 
positions = [i for i,w in enumerate(splitted) if lookfor == w.strip(".,-?!")]


# printing here, you can put those slices in a list for later usage
for p in positions:    # positions is: [1, 8, 21]
    print( ' '.join(splitted[max(0,p-26):p+26]) )
    print()

输出:

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs.

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs. revenues - based on the revenues from the fourth quarter of 2014, the

通过迭代拆分后的单词,您可以获得位置并对拆分后的列表进行切片。即使p-26 低于0,请确保从0 开始切片,否则您不会得到任何输出。 (-4 开头表示从字符串末尾开始)

【讨论】:

  • 谢谢帕特里克,这就是我想知道的,因为我找不到它是否可以处理重叠。
【解决方案2】:

一种非正则表达式方式:

string = "2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"
words = string.split()
starting25 = " ".join(words[:25])
ending25 = " ".join(words[-25:])
print(starting25)
print("\n")
print(ending25)

【讨论】:

  • 谢谢@user2229219。我也考虑过,但这会更长,如果有多个匹配项,我将不得不为每个匹配项生成这些文本。我想为此使用正则表达式,但如果没有其他方法,那么我打算使用拆分。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-12-31
  • 2012-04-13
  • 1970-01-01
相关资源
最近更新 更多