【问题标题】:nltk word_tokenize: why do sentence tokenization before word tokenization?nltk word_tokenize:为什么要在词标记化之前进行句子标记化?
【发布时间】:2018-01-02 12:01:16
【问题描述】:

source code 中所述,word_tokenize 在运行单词分词器(Treebank)之前运行一个句子分词器(Punkt):

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

def word_tokenize(text, language='english', preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into words
    :param text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

在词标记化之前进行句子标记化有什么好处?

【问题讨论】:

  • 好问题!!

标签: python nlp nltk tokenize


【解决方案1】:

NLTK 中使用的默认标记器 (nltk.word_tokenize) 是 TreebankWordTokenizer,最初来自 Michael Heilman's tokenizer.sed

我们在tokenizer.sed 中看到,它声明:

# Assume sentence tokenization has been done first, so split FINAL periods only. 
s=\([^.]\)\([.]\)\([])}>"']*\)[     ]*$=\1 \2\3 =g

此正则表达式将始终拆分最后一个句点,并假设预先执行了句子标记化。

按照树库标记器,nltk.tokenize.treebank.TreebankWordTokenizer 执行相同的正则表达式操作,documenting the behavior in the class docstring

class TreebankWordTokenizer(TokenizerI):
    """
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
    This is the method that is invoked by ``word_tokenize()``.  It assumes that the
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
    This tokenizer performs the following steps:
    - split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
    - treat most punctuation characters as separate tokens
    - split off commas and single quotes, when followed by whitespace
    - separate periods that appear at the end of line
    """

更具体地说,“出现在行尾的单独句点”是指this particular regex

# Handles the final period.
# NOTE: the second regex is the replacement during re.sub()
re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'), r'\1 \2\3 ')

通常假设句子标记化是在词标记化之前执行的吗?

也许,也许不是;取决于您的任务以及您如何评估任务。如果我们查看其他词标记器,我们会看到它们执行相同的最终周期拆分,例如在Moses (SMT) tokenizer:

# Assume sentence tokenization has been done first, so split FINAL periods only.
$text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;

同样在NLTK port of the Moses tokenizer:

# Splits final period at end of string.
FINAL_PERIOD = r"""([^.])([.])([\]\)}>"']*) ?$""", r'\1 \2\3'

另外,在toktok.pl 及其NLTK port


对于不希望自己的句子被分句的用户,preserve_line 选项可用,因为https://github.com/nltk/nltk/issues/1710 代码合并 =)

有关原因和内容的更多说明,请参阅https://github.com/nltk/nltk/issues/1699

【讨论】:

  • 为什么这个假设对于开发这些词标记器是必要的?有没有解释推理/动机的论文?
  • 看起来treebanks 本身是“注释句法或语义句子结构......通常创建在已经用词性标签注释的语料库之上”的结构。因此,这些模型的训练不一定需要假设,但至少对于基于树库的模型,模型本身基本上是围绕句子构建的。
  • 又是好问题!我一直在问自己同样的事情,为什么段落或文档不是#nlproc 中的默认值。为什么要造句?如果我们反其道而行之,为什么不使用语素呢?或句子片段,例如github.com/google/sentencepiece
猜你喜欢
  • 2022-09-30
  • 1970-01-01
  • 2015-02-19
  • 2016-10-03
  • 1970-01-01
  • 2019-08-07
  • 2011-08-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多