使用nltk.corpus API 访问的语料库通常返回一个文档流,即一个句子列表,每个句子都是一个标记列表。
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> emma[0]
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
>>> emma[1]
[u'VOLUME', u'I']
>>> emma[2]
[u'CHAPTER', u'I']
>>> emma[3]
[u'Emma', u'Woodhouse', u',', u'handsome', u',', u'clever', u',', u'and', u'rich', u',', u'with', u'a', u'comfortable', u'home', u'and', u'happy', u'disposition', u',', u'seemed', u'to', u'unite', u'some', u'of', u'the', u'best', u'blessings', u'of', u'existence', u';', u'and', u'had', u'lived', u'nearly', u'twenty', u'-', u'one', u'years', u'in', u'the', u'world', u'with', u'very', u'little', u'to', u'distress', u'or', u'vex', u'her', u'.']
对于nltk.corpus.gutenberg 语料库,它加载PlaintextCorpusReader,请参阅
https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L114
和https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py
所以它正在读取一个文本文件目录,其中一个是'austen-emma.txt',它使用默认的sent_tokenize 和word_tokenize 函数来处理语料库。在代码中它被实例化为tokenizers/punkt/english.pickle 和WordPunctTokenizer(),见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L40
所以要获得所需的句子字符串列表,请使用:
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> sents_list = [" ".join(sent) for sent in emma]
>>> sents_list[0]
u'[ Emma by Jane Austen 1816 ]'
>>> sents_list[1]
u'VOLUME I'
>>> sents_list[:1]
[u'[ Emma by Jane Austen 1816 ]']
>>> sents_list[:2]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I']
>>> sents_list[:3]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I', u'CHAPTER I']