有没有一种快速的方法来获取 spaCy 中每个句子的标记？答案

【问题标题】：Is there a fast way to get the tokens for each sentence in spaCy?有没有一种快速的方法来获取 spaCy 中每个句子的标记？
【发布时间】：2019-08-28 00:25:12
【问题描述】：

为了将我的句子分成标记，我正在执行以下操作，这很慢

 import spacy nlp = spacy.load("en_core_web_lg")

 text = "This is a test. This is another test"

 sentence_tokens = []
 doc = nlp(text) 
 for sent in doc.sents:
     words = nlp(sent.text)
     all = []
     for w in words:
         all.append(w)
         sentence_tokens.append(all)

我有点想按照 nltk 处理它的方式来执行此操作，您使用 sent_tokenize() 将文本分成句子，然后为每个句子运行 word_tokenize()

【问题讨论】：

标签： spacy

【解决方案1】：

您的方法的主要问题是您要处理所有内容两次。 doc.sents 中的一个句子是一个Span 对象，即Tokens 的序列。因此，无需再次在句子文本上调用nlp - spaCy 已经在后台为您完成了所有这些工作，并且您返回的Doc 已经包含了您需要的所有信息。

因此，如果您需要一个字符串列表，每个标记一个，您可以这样做：

sentence_tokens = []
for sent in doc.sents:
    sentence_tokens.append([token.text for token in sent])

甚至更短：

sentence_tokens = [[token.text for token in sent] for sent in doc.sents]

如果您要处理大量文本，您可能还想使用nlp.pipe 来提高效率。这将批量处理文本并产生Doc 对象。你可以阅读更多关于它的信息here。

texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
   sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
   # do something with the tokens

【讨论】：

【解决方案2】：

只做基于规则的标记化，非常快，运行：

nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])

不过，不会有句子界限。为此，您目前仍然需要解析器。如果你想要标记和句子边界：

nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])

如果您有很多文本，请运行nlp.tokenizer.pipe(texts)（类似于make_doc()）或nlp.pipe(texts)。

（一旦你运行了doc = nlp(text)，你就不需要在循环中的句子上再次运行它。所有的注释都应该在那里，你只会复制注释。那会特别慢。 )

【讨论】：