如何在 quanteda 中添加/减去文档项矩阵？答案

【问题标题】：How to do add/subtract document-term matrices in quanteda?如何在 quanteda 中添加/减去文档项矩阵？
【发布时间】：2019-05-19 02:06:26
【问题描述】：

考虑这个简单的例子

dfm1 <- tibble(text = c('hello world',
                         'hello quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
> dfm1
Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
2 x 3 sparse Matrix of class "dfm"
       features
docs    hello world quanteda
  text1     1     1        0
  text2     1     0        1

和

dfm2 <- tibble(text = c('hello world',
                        'good nigth quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
2 x 5 sparse Matrix of class "dfm"
       features
docs    hello world good nigth quanteda
  text1     1     1    0     0        0
  text2     0     0    1     1        1

如您所见，我们在两个dfms 中具有相同的文本标识符：text1 和text2。

我想将dfm2“减去”到dfm1，以便将dfm1 中的每个条目减去dfm2 中的（可能）匹配条目（相同的文本，相同的词）

例如，text1、hello 出现 1 次，text2 也出现 1 次。因此，该条目的输出应为 0（即：1-1）。当然，dfms 中都没有的条目应该保持不变。

我如何在 quanteda 中做到这一点？

【问题讨论】：

您想从dfm1 从 dfm2 中减去匹配特征的计数吗？或者dfm2 来自dfm1？
顺序并不重要。我想在 dfms 中添加或减去匹配功能。这有意义吗？

标签： r sparse-matrix quanteda

【解决方案1】：

您可以使用dfm_match() 将 dfm 的功能集与另一个 dfm 的功能集进行匹配。我还整理了您的代码，因为对于这个简短的示例，您的一些管道可以简化。

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dfm1 <- dfm(c("hello world", "hello quanteda"))
dfm2 <- dfm(c("hello world", "good night quanteda"))

as.dfm(dfm1 - dfm_match(dfm2, features = featnames(dfm1)))
## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
## 2 x 3 sparse Matrix of class "dfm"
##        features
## docs    hello world quanteda
##   text1     0     0        0
##   text2     1     0        0

as.dfm() 来自于 + 运算符是为父稀疏 Matrix 类定义的，而不是专门为 quanteda dfm 定义的，因此它下降了dfm 的类并将其转换为dgCMatrix。使用as.dfm() 将其强制转换回 dfm 可以解决这个问题，但它会删除 dfm 对象的原始属性，例如 docvars。

【讨论】：

其实我现在意识到一些更有趣的事情。可以只在 features 参数中使用特征的联合 (features(dfm1) UNION feaures(dfm2))，以便输出 dfm 包含 dfms 中使用的所有单词（不仅仅是共同的单词）。
要结合两个dfms的特征，你也可以使用dfm_compress(cbind(dfm1, dfm2), margin = "features")。