拆分文档特征矩阵中的 ngrams (quanteda)答案

【问题标题】：Split up ngrams in document-feature matrix (quanteda)拆分文档特征矩阵中的 ngrams (quanteda)
【发布时间】：2017-05-24 12:48:48
【问题描述】：

我想知道是否可以在文档特征矩阵 (dfm) 中拆分 ngram 特征，例如一个二元组会产生两个单独的一元组？

head(dfm, n = 3, nfeature = 4)

docs       in_the great plenary emission_reduction
  10752099      3     1       1                  3
  10165509      8     0       0                  3
  10479890      4     0       0                  1

所以，上面的 dfm 会产生这样的结果：

head(dfm, n = 3, nfeature = 4)

docs       in great plenary emission the reduction
  10752099  3     1       1        3   3         3
  10165509  8     0       0        3   8         3
  10479890  4     0       0        1   4         1

为了更好地理解：我从将特征从德语翻译成英语得到了 dfm 中的 ngram。复合词（“Emissionsminderung”）在德语中很常见，但在英语中并不常见（“emission reduction”）。

提前谢谢你！

编辑：以下可以用作可重现的示例。

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

head(eg.dfm)

【问题讨论】：

如果你有 2 个包含同一个词的二元组，比如“emission_reduction”和“emission_increase”，列中的数字应该为常用词（示例中的“emission”）求和吗？免责声明：这里不是专家，也许我在说一些毫无意义的事情......
是的，假设我们在一个文档中有两倍的双字母“emission_reduction”和一个“emission_increase”，结果应该是总共3个“emission”、2个“reduction”和1个“increase” .当例如“increase”也包含在 unigram 特征中，“increase”的总和应该是 2。
不幸的是我不知道 dfm 格式，我不知道它是否像 data.frames 一样工作......你能发布一个可重现的数据样本吗（例如发布 dput(head (dfm))?
当然！请在上面的原始问题中找到示例。谢谢！

标签： r quanteda

【解决方案1】：

我不知道是否是最好的方法（它可能会使用大量 RAM，因为它将稀疏的 dfm 变为 data.frame/matrix），但它应该可以工作：

# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)

结果：

> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
       features
docs    emission great in increase plenary reduction the
  text1        0     1  1        1       1         0   1
  text2        1     1  0        0       1         1   0
  text3        2     0  1        2       0         1   1

【讨论】：

如果我在开头添加DF <- as.data.frame(eg.dfm)，它似乎工作得很好。对吗？
太棒了！我认为这是使用数据框的一个很好的解决方法。感谢您的帮助！
请参阅 stackoverflow.com/questions/44538939/… 以获得更好的解决方案，以保持稀疏性。