使用 NLTK 和 Python 从文本文件中读取和写入带有 POS 标记的句子答案

【问题标题】：Reading and writing POS tagged sentences from text files using NLTK and Python使用 NLTK 和 Python 从文本文件中读取和写入带有 POS 标记的句子
【发布时间】：2011-07-31 06:22:23
【问题描述】：

有谁知道是否有现有的模块或简单的方法可以在文本文件中读写带有词性的句子？我正在使用 python 和自然语言工具包 (NLTK)。例如这段代码：

import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

print tagged

返回这个嵌套列表：

[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.')], [('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('-', ':'), ('never', 'RB'), ('mind', 'VBP'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('-', ':'), ('having', 'VBG'), ('little', 'RB'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'RB'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.')]]

我知道我可以轻松地将其转储到泡菜中，但我真的想将其导出为更大文本文件的一部分。我希望能够将列表导出到文本文件，然后稍后返回，解析它，并恢复原始列表结构。 NLTK 中是否有任何内置函数可以执行此操作？我看过了，但没找到……

示例输出：

<headline>Article headline</headline>
<body>Call me Ishmael...</body>
<pos_tags>[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP')...</pos_tags>

【问题讨论】：

标签： python nlp text-files nltk

【解决方案1】：

NLTK 具有用于标记文本的标准文件格式。它看起来像这样：

呼叫/NNP 我/PRP Ishmael/NNP ./.

您应该使用这种格式，因为它允许您使用 NLTK 的TaggedCorpusReader 和其他类似的类来读取您的文件，并获得完整的语料库阅读器功能。令人困惑的是，NLTK 中没有用于编写这种格式的标记语料库的高级函数，但这可能是因为它非常简单：

for sent in tagged:
    print " ".join(word+"/"+tag for word, tag in sent)

（NLTK 确实提供了nltk.tag.tuple2str()，但它只处理一个单词——只需键入word+"/"+tag 就更简单了）。

如果您以这种格式将标记文本保存在一个或多个文件fileN.txt 中，您可以使用nltk.corpus.reader.TaggedCorpusReader 将其读回，如下所示：

mycorpus = nltk.corpus.reader.TaggedCorpusReader("path/to/corpus", "file.*\.txt")
print mycorpus.fileids()
print mycorpus.sents()[0]
for sent in mycorpus.tagged_sents():
    <etc>

请注意，sents() 方法会为您提供未标记的文本，尽管间隔有点奇怪。没有必要在文件中同时包含标记和未标记的版本，如您的示例所示。

TaggedCorpusReader 不支持文件头（用于标题等），但如果您真的需要，您可以派生自己的类来读取文件元数据，然后像TaggedCorpusReader 一样处理其余部分。

【讨论】：

小错误，需要在join调用内部创建一个列表： print " ".join([word+"/"+tag for word, tag in sent])
@RahulJha，为什么？试试我写的。它被称为生成器，它无需提前构建结果列表即可工作（非常适合很长的列表，但无处不在）。

【解决方案2】：

似乎使用 pickle.dumps 并将其输出插入到您的文本文件中，也许使用标签包装器进行自动加载会满足您的要求。

您能否更具体地说明您希望文本输出的外观？你的目标是更易读的东西吗？

编辑：添加一些代码

from xml.dom.minidom import Document, parseString
import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

# Write to xml string
doc = Document()

base = doc.createElement("Document")
doc.appendChild(base)

headline = doc.createElement("headline")
htext = doc.createTextNode("Article Headline")
headline.appendChild(htext)
base.appendChild(headline)

body = doc.createElement("body")
btext = doc.createTextNode(sentences)
headline.appendChild(btext)
base.appendChild(body)

pos_tags = doc.createElement("pos_tags")
tagtext = doc.createTextNode(repr(tagged))
pos_tags.appendChild(tagtext)
base.appendChild(pos_tags)

xmlstring = doc.toxml()

# Read back tagged

doc2 = parseString(xmlstring)
el = doc2.getElementsByTagName("pos_tags")[0]
text = el.firstChild.nodeValue
tagged2 = eval(text)

print "Equal? ", tagged == tagged2

【讨论】：

谢谢。是的，如果可能的话，我希望它是人类可读的。我正在从报纸文章中提取数据并创建标记记录。我希望其中一个字段包含文章中带有 POS 标记的句子。有关理想输出的示例，请参见上面的编辑...
你想要的输出好像和你列表的python repr一样？
是的，但是一旦我用 repr() 将它变成一个字符串，有没有办法将它转换回列表？
未来读者：这段代码没有问题，但它不是 NLTK 中的最佳方法。请看我的回答。