【问题标题】:Read a file one sentence at a time?一次读一个文件?
【发布时间】:2020-12-24 15:20:36
【问题描述】:
知道如何一次读取一个句子而不是一次一行。
一般的想法是可以预读,当检测到句末时返回句子。
现在棘手的部分来了,EOS 通常是一个“点”,但并非总是如此。
像 spacy 这样的工具可以检测 EOL,但希望整个文档都可用。
如果要将逻辑隐藏为生成器/迭代器,则代码将如下所示...
with SentenceFile.open(....) as sf :
for sent in sf.next_stentence() :
.....
【问题讨论】:
标签:
python
nltk
increment
spacy
sentence
【解决方案1】:
这似乎是 nltk 的工作。它可以通过句子轻松标记文本,然后您可以循环它们。
import nltk
with open("text.txt", "r") as f:
text = f.read()
text_sentenced = nltk.sent_tokenize(text)
for sentence in text_sentenced:
# do what you want with this sentence
【解决方案2】:
我结束了使用链式缓冲区:行缓冲区提供一个句子缓冲区..然后迭代器消耗句子缓冲区。
我也使用 spacy 来检测句子边界。糟糕的是我不得不做两次......
import re
import spacy
from collections import deque
from spacy.lang.en.stop_words import STOP_WORDS
class Corpus(object):
nlp = spacy.load("en_core_web_sm")
def __init__(self, fname, kind='sentence', lemmas=True, stop_words=True, filter_punct=True):
self.fname = fname
self.file = open(fname,'r')
self.kind = kind
self.filter_punct = filter_punct
self.stop_words = stop_words
self.lemmas = lemmas
self.last_sent = ''
self.sents_buf = deque()
self.line_buf = deque()
self.closed = False
def __iter__(self): return self
def __next__(self):
if self.kind == 'line' :
if len(self.line_buf) == 0 and self.buffer_lines() is False : raise StopIteration
else : return self.line_buf.popleft()
if self.kind == 'sentence' :
if len(self.sents_buf) == 0 and self.buffer_sents() is False : raise StopIteration
else : return self.sents_buf.popleft()
def buffer_lines(self):
if self.closed : return False
i = 0
while i < 10 :
i += 1
line = self.file.readline()
if line == '' :
self.file.close()
self.closed = True
self.line_buf.append('*') # add end marker, so last line is processed
say('> closing file ....')
return True
self.line_buf.append(line)
return True
def buffer_sents(self):
i = 0
while i < 10 :
i += 1
full = True
if len(self.line_buf) == 0 : full = self.buffer_lines()
if full :
line = self.line_buf.popleft()
if not re.match(r'^\s*$', line) :
txt = self.last_sent + line #prepend
sents = [ s.text for s in list(Corpus.nlp(txt).sents) ]
self.last_sent = sents.pop() #pull the first .. to prepend later
self.sents_buf.extend(sents)
else :
if len(self.sents_buf) == 0 : return False
return True