【发布时间】:2019-05-17 07:52:31
【问题描述】:
对于特定的极端情况,我在正确识别文本中的句子时遇到了一些麻烦:
- 如果涉及点、点、点,则不会保留。
- 如果涉及
"。 - 如果句子意外以小写字母开头。
这就是我目前在文本中识别句子的方式(来源:Subtitles Reformat to end with complete sentence):
re.findall 部分基本上是查找以大写字母 [A-Z] 开头的 str 的一部分,然后是除标点符号之外的任何内容,然后以标点符号 [\.?!] 结尾。
import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
转角案例 1:点、点、点
不保留点、点、点,因为没有说明如果三个点连续出现该怎么办。这怎么可能改变?
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
转角案例 2:"
"symbol 成功保留在一个句子中,但就像点在标点符号后面一样,它会在末尾被删除。
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first "research" question: "What is this? Next, we also determined the size of the population.
转角案例 3:句子的小写开头
如果一个句子意外以小写字母开头,该句子将被忽略。目的是确定前一个句子结束(或文本刚刚开始),因此必须开始一个新句子。
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
非常感谢您的帮助!
编辑:
我测试过:
import spacy
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
...但我明白了:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-157-4fd093d3402b> in <module>() 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] <ipython-input-157-4fd093d3402b> in <listcomp>(.0) 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] doc.pyx in sents() ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:nlp.add_pipe(nlp.create_pipe('sentencizer')) 或者,添加 依赖解析器,或通过设置来设置句子边界 doc[i].is_sent_start.
【问题讨论】:
-
好主意。如果没有其他选项可用,可能会这样做。
-
不是关于你的极端情况,而是一个普遍的想法:也许你可以使用指示符将文本分成句子
.,一个点后跟空格,而不是其他点?如果至少这是一个共同因素,那么所有其他想法(如引号等)都可以忽略,但我只是猜测。要创建一个匹配前面没有其他指定字符的点的正则表达式,请参阅:regular-expressions.info/lookaround.html
标签: python regex python-3.x string