【问题标题】:Python: Removing all words that start with a Capital Letter and does not arise after punctuationPython:删除所有以大写字母开头且标点符号后不出现的单词
【发布时间】:2019-02-16 13:30:41
【问题描述】:

我想使用正则表达式从文本中删除所有以大写字母开头并满足以下两个条件的单词:

1) 它们后面只能跟小写字母或“'s”(所有格)或标点符号(.,?!)。

2) 它们不在“.”、“!”之后。和“?”

我试过了

import re

myString='The name of her company is Water Company WC 123 WaTerCompany! She was going to meet Daniel. Why? Because Daniel is her boy friend. Patricia? The daughter of Susana! Look, Daniel\'s car is white'
regex='([A-Z][a-z\']*)(\s[A-Z][a-z\']*)*'
txt = re.sub(regex, " ", myString)        

我得到了

name of her company is    123    !   was going to meet  .  ?   is her boy friend.  ?   daughter of  !  ,   car is white

我想要

name of her company is  WC 123 WaTerCompany! She was going to meet . Why? Because is her boy friend. Patricia? The daughter of ! Look, car is white

【问题讨论】:

  • 为什么Patricia 在您的预期输出中被删除?它是紧跟在. 之后的一个大写单词。
  • 你是对的。对不起!已编辑!
  • 还有一个小问题:Look 之后的, 也不应该被删除。
  • 嗯,有一种方法可以支持单词前任意数量的空格。
  • 检查this demo

标签: regex python-3.x


【解决方案1】:

要删除整个单词,您需要使用\b 边界锚点,这样您就不会匹配部分单词。要删除标点符号前面的单词,您可以使用否定的lookbehind,假设标点符号和第一个字母之间总是有固定数量的空格。

我将假设标点符号和下一个字母之间总是有一个空格。您始终可以通过用一个空格替换多个空格来首先规范化您的输入。

这使得正则表达式删除了这些词:

\b(?<![!?.]\s)[A-Z][a-z]*(?:'s)?\b

还有一个演示:

>>> import re
>>> myString='The name of her company is Water Company WC 123 WaTerCompany! She was going to meet Daniel. Why? Because Daniel is her boy friend. Patricia? The daughter of Susana! Look, Daniel\'s car is white'
>>> regex = r'\b(?<![!?.]\s)[A-Z][a-z]*(?:'s)?\b'
>>> re.sub(regex, " ", myString)
'  name of her company is     WC 123 WaTerCompany! She was going to meet  . Why? Because   is her boy friend. Patricia? The daughter of  ! Look,   car is white'

或在线尝试该模式,regex101

【讨论】:

  • lookbehind 应该放在单词边界之后以获得更好的性能。
  • @WiktorStribiżew:谢谢,这确实从匹配过程中删除了大约 180 个步骤。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-03-12
  • 2022-01-22
  • 1970-01-01
  • 2018-05-05
  • 1970-01-01
  • 1970-01-01
  • 2022-11-23
相关资源
最近更新 更多