【发布时间】:2018-01-02 12:01:16
【问题描述】:
如source code 中所述,word_tokenize 在运行单词分词器(Treebank)之前运行一个句子分词器(Punkt):
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:param text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
在词标记化之前进行句子标记化有什么好处?
【问题讨论】:
-
好问题!!