Python 正则表达式，条件搜索答案

【问题标题】：Python regex, conditional searchingPython 正则表达式，条件搜索
【发布时间】：2015-02-23 09:29:18
【问题描述】：

我正在尝试拆分这句话

"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot " \
"for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this " \
"isn't true... Well, with a probability of .9 it isn't."

进入下面的列表。

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

代码：

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )[^a-z]',text)

输出：

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', "Adam Jones Jr. thinks he didn't. "]

K gud，但它错过了一些，有没有办法告诉 Python，因为 last [^a-z] 不是我的组的一部分，请从那里继续搜索。

编辑：

这是通过@sputnick 提到的前瞻性正则表达式实现的。

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

输出：

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... "]

但我们仍然需要最后一句话。有什么想法吗？

【问题讨论】：

相关：Python - RegEx for splitting text into sentences (sentence-tokenizing).

标签： python regex

【解决方案1】：

(.+?)(?<=(?<![A-Z][a-z])(?<![a-z]\.[a-z])(?:\.|\?)(?=\s|$))

试试这个。查看演示。获取捕获或组。使用re.findall。

https://regex101.com/r/gQ3kS4/45

【讨论】：

【解决方案2】：

终于

 print re.findall('[A-Z]+[^.].*?[a-z.][.?!] (?=[^a-z])|.*.$',text)

以上功能可根据需要完美运行。包括最后一句话。但我不知道为什么|.*.$ 工作请帮助我理解。

输出：

['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
 a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... ", "Well, with a probability of .9 
it isn't."]

【讨论】：

末尾没有空格：re.findall('[A-Z]+[^.].*?[a-z.][.?!](?: (?=[^a-z])|$)', text)

【解决方案3】：

试试这个：

print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)

使用正向预测正则表达式技术，检查http://www.regular-expressions.info/lookaround.html

【讨论】：

哇，正则表达式很棒，完美。谢谢@sputnick。 ?= 究竟是什么意思？
这是 positive look-ahead 的语法，请查看我的答案中添加的链接
链接上的好教程，是否有办法也包括最后一句话说排除照顾空格的点和 [^a-z] 它的文件结尾。类似于单词边界