识别文本中的句子答案

【问题标题】：Identify Sentences in Text识别文本中的句子
【发布时间】：2019-05-17 07:52:31
【问题描述】：

对于特定的极端情况，我在正确识别文本中的句子时遇到了一些麻烦：

如果涉及点、点、点，则不会保留。
如果涉及"。
如果句子意外以小写字母开头。

这就是我目前在文本中识别句子的方式（来源：Subtitles Reformat to end with complete sentence）：

re.findall 部分基本上是查找以大写字母 [A-Z] 开头的 str 的一部分，然后是除标点符号之外的任何内容，然后以标点符号 [\.?!] 结尾。

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

转角案例 1：点、点、点

不保留点、点、点，因为没有说明如果三个点连续出现该怎么办。这怎么可能改变？

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

转角案例 2："

"symbol 成功保留在一个句子中，但就像点在标点符号后面一样，它会在末尾被删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

转角案例 3：句子的小写开头

如果一个句子意外以小写字母开头，该句子将被忽略。目的是确定前一个句子结束（或文本刚刚开始），因此必须开始一个新句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

非常感谢您的帮助！

编辑：

我测试过：

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

...但我明白了：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:

nlp.add_pipe(nlp.create_pipe('sentencizer')) 或者，添加依赖解析器，或通过设置来设置句子边界 doc[i].is_sent_start.

【问题讨论】：

好主意。如果没有其他选项可用，可能会这样做。
不是关于你的极端情况，而是一个普遍的想法：也许你可以使用指示符将文本分成句子 . ，一个点后跟空格，而不是其他点？如果至少这是一个共同因素，那么所有其他想法（如引号等）都可以忽略，但我只是猜测。要创建一个匹配前面没有其他指定字符的点的正则表达式，请参阅：regular-expressions.info/lookaround.html

标签： python regex python-3.x string

【解决方案1】：

您可以修改您的正则表达式以匹配您的极端情况。

首先，你不需要在[]里面转义.

对于第一个极端情况，你可以贪婪地将结尾句子标记与[.!?]*匹配

第二个，你可以在[.!?]之后匹配"

对于最后一个，你可以用上或下开始你的句子：

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

[A-z]，每场比赛都应该以一个字母开头，无论是大写还是小写。
[^.?!]*，它贪婪地匹配任何不是.、?或!的字符（句尾字符）
[.?!]*，它贪婪地匹配结尾字符，所以...??!!???将作为句子的一部分匹配
"?，它最终匹配句子末尾的引用

转角案例 1：

我们能够回答第一个研究问题... 接下来，我们还确定了人口规模。

转角案例 2：

我们能够回答第一个“研究”问题：“这是什么？” 接下来，我们还确定了人口规模。

转角案例 3：

我们能够回答第一个研究问题。接下来，我们还确定了人口规模。

【讨论】：

很好的答案！这就是我一直在寻找的。只是一个简单的问题：“它贪婪地匹配”是什么意思？
表示会匹配...，非贪心匹配*?不会匹配...
是的。非常感谢。

【解决方案2】：

您可以为此使用一些工业软件包。例如，spacy 有一个非常好的句子分词器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

你的场景：

案例结果->['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
案例结果->['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
案例结果->['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

【讨论】：

感谢您的回答。这是免费的吗？
是的。没问题:)
您介意尝试我的问题的三个示例并将结果发布在您的答案中吗？
您介意看看我更新的问题吗？我测试了你的方法，但我一直收到错误。
我没有看到那个错误。尝试转到 Spacy 页面并尝试下载您需要的所有内容（如 neg 字典等）。那应该可以解决您的问题。

【解决方案3】：

试试这个正则表达式： ([A-Z][^.!?]*[.!?]+["]?)

'+'表示一个或多个

'?'表示零个或多个

这应该会通过您上面提到的所有 3 个极端情况

【讨论】：