【问题标题】:Identify Sentences in Text识别文本中的句子
【发布时间】:2019-05-17 07:52:31
【问题描述】:

对于特定的极端情况,我在正确识别文本中的句子时遇到了一些麻烦:

  1. 如果涉及点、点、点,则不会保留。
  2. 如果涉及"
  3. 如果句子意外以小写字母开头。

这就是我目前在文本中识别句子的方式(来源:Subtitles Reformat to end with complete sentence):

re.findall 部分基本上是查找以大写字母 [A-Z] 开头的 str 的一部分,然后是除标点符号之外的任何内容,然后以标点符号 [\.?!] 结尾。

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

转角案例 1:点、点、点

不保留点、点、点,因为没有说明如果三个点连续出现该怎么办。这怎么可能改变?

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

转角案例 2:"

"symbol 成功保留在一个句子中,但就像点在标点符号后面一样,它会在末尾被删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

转角案例 3:句子的小写开头

如果一个句子意外以小写字母开头,该句子将被忽略。目的是确定前一个句子结束(或文本刚刚开始),因此必须开始一个新句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

非常感谢您的帮助!

编辑:

我测试过:

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

...但我明白了:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:

nlp.add_pipe(nlp.create_pipe('sentencizer')) 或者,添加 依赖解析器,或通过设置来设置句子边界 doc[i].is_sent_start.

【问题讨论】:

  • 好主意。如果没有其他选项可用,可能会这样做。
  • 不是关于你的极端情况,而是一个普遍的想法:也许你可以使用指示符将文本分成句子 . ,一个点后跟空格,而不是其他点?如果至少这是一个共同因素,那么所有其他想法(如引号等)都可以忽略,但我只是猜测。要创建一个匹配前面没有其他指定字符的点的正则表达式,请参阅:regular-expressions.info/lookaround.html

标签: python regex python-3.x string


【解决方案1】:

您可以修改您的正则表达式以匹配您的极端情况。

首先,你不需要在[]里面转义.

对于第一个极端情况,你可以贪婪地将结尾句子标记与[.!?]*匹配

第二个,你可以在[.!?]之后匹配"

对于最后一个,你可以用上或下开始你的句子:

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

  • [A-z],每场比赛都应该以一个字母开头,无论是大写还是小写。
  • [^.?!]*,它贪婪地匹配任何不是.?!的字符(句尾字符)
  • [.?!]*,它贪婪地匹配结尾字符,所以...??!!???将作为句子的一部分匹配
  • "?,它最终匹配句子末尾的引用

转角案例 1:

我们能够回答第一个研究问题... 接下来,我们还确定了人口规模。

转角案例 2:

我们能够回答第一个“研究”问题:“这是什么?” 接下来,我们还确定了人口规模。

转角案例 3:

我们能够回答第一个研究问题。 接下来,我们还确定了人口规模。

【讨论】:

  • 很好的答案!这就是我一直在寻找的。只是一个简单的问题:“它贪婪地匹配”是什么意思?
  • 表示会匹配...,非贪心匹配*?不会匹配...
  • 是的。非常感谢。
【解决方案2】:

您可以为此使用一些工业软件包。例如,spacy 有一个非常好的句子分词器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

你的场景:

  1. 案例结果->['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']

  2. 案例结果->['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']

  3. 案例结果->['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

【讨论】:

  • 感谢您的回答。这是免费的吗?
  • 是的。没问题:)
  • 您介意尝试我的问题的三个示例并将结果发布在您的答案中吗?
  • 您介意看看我更新的问题吗?我测试了你的方法,但我一直收到错误。
  • 我没有看到那个错误。尝试转到 Spacy 页面并尝试下载您需要的所有内容(如 neg 字典等)。那应该可以解决您的问题。
【解决方案3】:

试试这个正则表达式: ([A-Z][^.!?]*[.!?]+["]?)

'+'表示一个或多个

'?'表示零个或多个

这应该会通过您上面提到的所有 3 个极端情况

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-01-12
    • 1970-01-01
    • 2022-07-14
    • 1970-01-01
    • 2018-11-02
    • 2012-09-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多