如何以 (name): (sentence)\n(name): 格式捕获文件中的所有句子答案

【问题标题】：How can I capture all sentences in a file with the format of (name): (sentence)\n(name):如何以 (name): (sentence)\n(name): 格式捕获文件中的所有句子
【发布时间】：2018-10-26 19:57:54
【问题描述】：

我有格式为的成绩单文件

(name): (sentence)\n (

（姓名）：（句子）\n
（句子）\n

等等。我需要所有的句子。到目前为止，我已经通过对文件中的名称进行硬编码来使其工作，但我需要它是通用的。

utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)

Python 3.6 使用 re.或者如果有人知道如何使用 spacy 做到这一点，那将是一个很大的帮助，谢谢。

我只想在一个空语句之后获取 \n，并将其放入它自己的字符串中。而且我想我只需要抓住最后给出的磁带信息，例如，因为我想不出一种方法来区分这句话是否是某人演讲的一部分。有时，行首和冒号之间的单词不止一个。

模拟数据：

CRO：您离世贸中心有多远，大约有多少个街区？三或四个街区？

63FDNY 911 通话记录 - EMS - 第 1 部分 9-11-01

来电者：

CRO：不客气。谢谢。

接线员：再见。

CRO：再见。

记录者：磁带的前一部分在 0913 时 36 秒结束。

此磁带将在 B 面继续。

操作员纽维尔：废话。

【问题讨论】：

我怀疑如果您提供示例数据，您会得到更好的响应，这样人们就不必花费自己的时间来模拟数据来测试正则表达式。
您的正则表达式似乎比您描述的要复杂。

标签： python regex spacy

【解决方案1】：

您可以使用前瞻表达式，在行首查找名称的相同模式，并后跟冒号：

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这个输出：

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

【讨论】：

【解决方案2】：

你从来没有给我们模拟数据，所以我使用以下内容进行测试：

name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.

我们可以尝试使用以下模式进行匹配：

^\S+:\s+((?:(?!^\S+:).)+)

这可以解释为：

^\S+:\s+           match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+)  then match and capture everything up until the next name

请注意，这会处理最后一句的边缘情况，因为上面使用的否定前瞻将不正确，因此将捕获所有剩余的内容。

代码示例：

import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)

['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']

Demo

【讨论】：

谢谢，这很好用。我现在需要的唯一情况是在行首之后的冒号之前有多个单词。很抱歉之前没有提供模拟数据。