如何在 python 库 nltk 中计算古腾堡语料库中的单词覆盖率？

【问题标题】：how to calculate Word Coverage in gutenburg corpus in python library nltk?如何在 python 库 nltk 中计算古腾堡语料库中的单词覆盖率？
【发布时间】：2020-03-24 17:20:50
【问题描述】：

计算与文本语料库 Gutenberg 关联的所有文件 ID 的单词覆盖率。这个的写代码是什么，

import nltk
from nltk.corpus import gutenburg
from decimal import Decimal

for fileid in gutenburg.fileids():
  n_chars = len(gutenburg.raw(fileid))
  n_words = len(gutenburg.words(fileids))
  print(round(Decimal(n_chars/n_words), 7), fileids)

【问题讨论】：

标签： python-3.x nltk nltk-book

【解决方案1】：

import nltk

from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
    total_unique_words = len(set(gutenberg.words(fileid)))
    total_words = len(gutenberg.words(fileid))
    print(total_words/total_unique_words,fileid)

【讨论】：

请不要只发布代码作为答案，而是说明您的代码的作用以及它如何解决问题的问题。带有解释的答案通常质量更高，更有可能吸引投票。