【问题标题】:understanding semcor corpus structure h理解 semcor 语料库结构 h
【发布时间】:2011-01-03 10:27:00
【问题描述】:

我正在学习 NLP。我目前正在玩 Word Sense Disambiguation。我打算使用 semcor 语料库作为训练数据,但我无法理解 xml 结构。我尝试谷歌搜索,但没有得到任何描述 semcor 内容结构的资源。

<s snum="1">
<wf cmd="ignore" pos="DT">The</wf>
<wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf>
<wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf>
<wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf>
<wf cmd="ignore" pos="DT">an</wf>
<wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf>
<wf cmd="ignore" pos="IN">of</wf>
<wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf>
<wf cmd="ignore" pos="POS">'s</wf>
<wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf>
<wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf>
<wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf>
<punc>``</punc>
<wf cmd="ignore" pos="DT">no</wf>
<wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf>
<punc>''</punc>
<wf cmd="ignore" pos="IN">that</wf>
<wf cmd="ignore" pos="DT">any</wf>
<wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf>
<wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf>
<punc>.</punc>
</s>
  • 我假设 wnsn 是“词义”。对吗?
  • lexsn 属性是什么意思?它如何映射到 wordnet?
  • 属性 pn 指的是什么? (第三行)
  • rdf 属性是如何分配的? (又是第三行)
  • 一般来说,可能的属性是什么?

【问题讨论】:

  • 你明白了吗.. 我需要将这些数据转换为 WSD 分类任务。我该怎么做?

标签: linguistics corpus nlp


【解决方案1】:

格式在SemCor 1.6 archive中的“doc/cxtfile.txt”文件中有描述;出于某种原因,以后的版本中不包含文档。

【讨论】:

  • wnsn 是“使用的词”或其“词形化形式”,因为它们可以不同。
  • 以上链接失效了。这是当前的SemCor 1.6 archive,您可以从here 下载其他 SemCor 版本。
猜你喜欢
  • 2021-11-01
  • 2021-10-03
  • 1970-01-01
  • 2013-09-11
  • 2014-07-25
  • 1970-01-01
  • 1970-01-01
  • 2015-02-17
  • 2017-10-31
相关资源
最近更新 更多