如何在python中使用迭代器在句子边界分割句子答案

【问题标题】：How to split sentence at sentence boundaries using iterators in python如何在python中使用迭代器在句子边界分割句子
【发布时间】：2018-05-03 00:54:14
【问题描述】：

您好，我必须解析一个字符串，以便在标点符号处拆分它并将每个句子写在单独的行上。在某些情况下，标点符号不是句子边界，因此我不会将其拆分（出于调试目的，我会在这些情况发生时打印一条消息）。

以下是我的代码（下）：

line 是我正在阅读的字符串
标点符号列表是一个预定义列表（不是那么重要）
sentence_boundary 是我试图用来知道何时拆分句子的布尔值
我使用i、prev和c来检查current、next 和 next,next 个字符

由于我向后工作，代码找到了所有 NOT 句子边界的条件。它检查多种情况并使用迭代器检查下一个字符。因为我使用的是迭代器，所以我决定每次都使用递归来传递一个较小的字符串，这样我就可以迭代地搜索整个字符串。该功能正在运行。

但是，目标是在标点符号 IS 实际上是句子边界的点处拆分字符串（即不满足其他情况时）。由于我的递归函数，我让自己陷入了一个问题，我无法跟踪我所在的列表的索引，因此不知道在哪里拆分句子。我正在考虑以某种方式使用辅助函数，但我不知道如何跟踪索引。

如果能帮助我修改此代码，我们将不胜感激。我知道我的方法有点倒退（而不是寻找在哪里拆分我正在寻找不拆分它的句子的位置），但如果可能的话，我仍然希望使用此代码。

def parse(line): #function

sentence_boundary = True

if (len(line) == 3):
    return

t = iter(line)
i = next(t)
prev = next(t)
c = next(t)

# periods followed by a digit with no intervening whitespace are not sentence boundaries
if i == "." and (prev.isdigit()):
    print("This is a digit")
    sentence_boundary = False

# periods followed by certain kinds of punctuation are probably not sentence boundaries
for j in punctuation_list:
    if i == "." and (prev == j):
        print("Found a punctuation")
        sentence_boundary = False


# periods followed by a whitespace followed by a lower case letter are not sentence boundaries
if (i == "." and prev == " " and c.islower()):
    print("This is a lower letter")
    sentence_boundary = False

# periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries
if i == '.' and prev.islower() and c.islower():
    print("This is a period within a sentence")
    sentence_boundary = False

# periods followed by a whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries
if c == '.' and prev.islower() and i.isupper():
    print("This is a title")
    sentence_boundary = False

index = line.index(i)

parse(line[index+1:])


if __name__ == "__main__":
    parse(line)

【问题讨论】：

标签： python string split iterator

【解决方案1】：

我认为您的代码很难遵循。 prev 通常是“previous”的缩写，因此将其与“next”的含义一起使用对我来说毫无意义。

在递归调用之间保持额外状态（如索引）的常用方法是将其作为额外参数传递。您可以使用默认值0开始第一次调用

def parse(line, index=0): #function
    ...
    parse(line, index+1)

【讨论】：