【问题标题】：Words in WordNet corpus clarificationWordNet 语料库中的单词说明
【发布时间】：2023-04-04 08:43:02
【问题描述】：

我想获取WordNet语料库中单词的长度

代码：

from nltk.corpus import wordnet as wn

len_wn = len([word.lower() for word in wn.words()])
print(len_wn)

我得到的输出为147306

我的问题：

我得到了WordNet 中单词的总长度吗？
tokens 等zoom_in 算不算word？

【问题讨论】：

尝试打印出wn.words()中的内容
@alvas 这个问题不是重复的 - 我想检查我是否采用了正确的方法来获得 wordnet 的总长度，而不是找到 word 是否在 @ 987654331@，这是您指出此问题重复的地方:)
打印出wn.words() 并查看它们会有很大帮助。

标签： nlp nltk wordnet nltk-book

【解决方案1】：

我得到了 WordNet 中单词的总长度吗？

取决于“单词”的定义。 wn.words() 函数遍历所有 lemma_names、https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701 和 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1191

def words(self, lang="eng"):
    """return lemmas of the given language as list of words"""
    return self.all_lemma_names(lang=lang)


def all_lemma_names(self, pos=None, lang="eng"):
    """Return all lemma names for all synsets for the given
    part of speech tag and language or languages. If pos is
    not specified, all synsets for all parts of speech will
    be used."""

    if lang == "eng":
        if pos is None:
            return iter(self._lemma_pos_offset_map)
        else:
            return (
                lemma
                for lemma in self._lemma_pos_offset_map
                if pos in self._lemma_pos_offset_map[lemma]
            )

因此，如果“单词”的定义是所有可能的引理，那么是的，这个函数会为您提供 Wordnet 中引理名称中单词的总长度：

>>> sum(len(lemma_name) for lemma_name in wn.words())
1692291
>>> sum(len(lemma_name.lower()) for lemma_name in wn.words())
1692291

小写不是必需的，因为引理名称应该是小写的。甚至命名实体，例如

>>> 'new_york' in wn.words()
True

但请注意，相同的引理可以有非常相似的引理名称：

>>> 'new_york' in wn.words()
True
>>> 'new_york_city' in wn.words()
True

这是因为 wordnet 的结构。 NLTK 中的 API 将“含义”组织为同义词集，一个同义词集包含链接到多个引理，每个引理至少有一个名称：

>>> wn.synset('new_york.n.1')
Synset('new_york.n.01')

>>> wn.synset('new_york.n.1').lemmas()
[Lemma('new_york.n.01.New_York'), Lemma('new_york.n.01.New_York_City'), Lemma('new_york.n.01.Greater_New_York')]

>>> wn.synset('new_york.n.1').lemma_names()
['New_York', 'New_York_City', 'Greater_New_York']

但是您查询的每个“单词”都可以有多个同义词（即多重含义），例如

>>> wn.synsets('new_york')
[Synset('new_york.n.01'), Synset('new_york.n.02'), Synset('new_york.n.03')]

zoom_in 等标记是否算作单词？

取决于“单词”的定义，如上面的示例，如果您遍历 wn.words()，则您正在遍历 lemma_names 并且 new_york 示例表明在引理中存在多词表达式每个同义词集的名称列表。

【讨论】：