使用 R 语料库保存文档 ID答案

【问题标题】：Keep document ID with R corpus使用 R 语料库保存文档 ID
【发布时间】：2014-08-21 11:55:29
【问题描述】：

我搜索了 stackoverflow 和网络，只能找到部分解决方案，或者由于 TM 或 qdap 的更改而无法工作的一些解决方案。问题如下：

我有一个数据框：ID 和 Text（简单文档 id/name 和一些 text）

我有两个问题：

第 1 部分：如何创建 tdm 或 dtm 并维护文档名称/ID？它仅在检查（tdm）上显示“字符（0）”。
第 2 部分：我只想保留特定的术语列表，即与删除自定义停用词相反。我希望这发生在语料库中，而不是 tdm/dtm。

对于第 2 部分，我使用了我在这里得到的解决方案：How to implement proximity rules in tm dictionary for counting words?

这发生在 tdm 部分！第 2 部分是否有更好的解决方案，您可以使用“tm_map(my.corpus, keepOnlyWords, customlist)”之类的内容？

任何帮助将不胜感激。非常感谢！

【问题讨论】：

标签： r text text-mining tm corpus

【解决方案1】：

首先，这是一个示例 data.frame

dd<-data.frame(
    id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

现在，为了从 data.frame 中读取特殊属性，我们将使用 readTabular 函数来制作我们自己的自定义 data.frame 阅读器。这就是我们需要做的所有事情

library(tm)
myReader <- readTabular(mapping=list(content="text", id="id"))

我们只是指定用于内容的列和 data.frame 中的 id。现在我们使用DataframeSource 读取它，但使用我们的自定义读取器。

tm <- VCorpus(DataframeSource(dd), readerControl=list(reader=myReader))

现在，如果我们只想保留一组单词，我们可以创建自己的content_transformer 函数。一种方法是

keepOnlyWords<-content_transformer(function(x,words) {
    regmatches(x, 
        gregexpr(paste0("\\b(",  paste(words,collapse="|"),"\\b)"), x)
    , invert=T)<-" "
    x
})

这会将不在单词列表中的所有内容替换为空格。请注意，您可能希望在此之后运行 stripWhitespace。因此我们的转换看起来像

keep<-c("wonder","then","that","the")

tm<-tm_map(tm, content_transformer(tolower))
tm<-tm_map(tm, keepOnlyWords, keep)
tm<-tm_map(tm, stripWhitespace)

然后我们可以把它变成一个文档术语矩阵

dtm<-DocumentTermMatrix(tm)
inspect(dtm)

# <<DocumentTermMatrix (documents: 4, terms: 4)>>
# Non-/sparse entries: 7/9
# Sparsity           : 56%
# Maximal term length: 6
# Weighting          : term frequency (tf)

#     Terms
# Docs that the then wonder
#   10    1   1    1      1
#   11    2   0    0      0
#   12    0   1    0      0
#   13    0   3    0      0

它有我们的单词列表和来自 data.frame 的正确文档 ID

【讨论】：

好东西！梦想成真！
但是如何把它变成一个ID映射的数据框，以便我们可以将它用于其他实验呢？
过时的答案，readTabular 不再存在

【解决方案2】：

在较新版本的 tm 中，使用 DataframeSource() 函数要容易得多。

“数据框源将数据框 x 的每一行解释为一个文档。第一列必须命名为“doc_id”并包含每个文档的唯一字符串标识符。第二列必须命名为“text”并包含表示文档内容的“UTF-8”编码字符串。可选的附加列用作文档级元数据。"

所以在这种情况下：

dd <-data.frame(
    doc_id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

Corpus = VCorpus(DataframeSource(dd))

【讨论】：