【问题标题】:R: removal of regex from Quanteda DFM, Sparse Document-Feature Matrix, object?R:从 Quanteda DFM、稀疏文档特征矩阵、对象中删除正则表达式?
【发布时间】:2017-05-25 23:42:57
【问题描述】:

Quanteda 包提供稀疏文档特征矩阵 DFM,其方法包含removeFeatures。我试过dfm(x, removeFeatures="\\b[a-z]{1-3}\\b") 删除太短的单词以及dfm(x, keptFeatures="\\b[a-z]{4-99}\\b") 保留足够长的单词但不起作用,基本上是在做同样的事情,即删除太短的单词。

如何从 Quanteda DFM 对象中删除正则表达式匹配?

例子。

myMatrix <-dfm(myData, ignoredFeatures = stopwords("english"), 
           stem = TRUE, toLower = TRUE, removeNumbers = TRUE, 
           removePunct = TRUE, removeSeparators = TRUE, language = "english")
#
#How to use keptFeatures/removeFeatures here?


#Instead of RemoveFeatures/keptFeatures methods, I tried it like this but not working
x<-unique(gsub("\\b[a-zA-Z0-9]{1,3}\\b", "", colnames(myMatrix))); 
x<-x[x!=""]; 
mmyMatrix<-myMatrix; 
colnames(mmyMatrix) <- x

DFM 示例

myData <- c("a aothu oat hoah huh huh huhhh h h h n", "hello h a b c d abc abcde", "hello hallo hei hej", "Hello my name is hhh.")
myMatrix <- dfm(myData)

【问题讨论】:

  • 可能类似于dfm_select(myMatrix, "^[[:alnum:]]{1,3}$", "remove", valuetype = "regex")

标签: r regex matrix sparse-matrix quanteda


【解决方案1】:

它是dfm_select,在 >= v0.9.9:

myMatrix
## Document-feature matrix of: 4 documents, 22 features (70.5% sparse).

dfm_select(myMatrix, "\\b[a-zA-Z0-9]{1,3}\\b", selection = "keep", valuetype = "regex")
## kept 14 features, from 1 supplied (regex) feature types
## Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
## 4 x 14 sparse Matrix of class "dfmSparse"
##        features
## docs    a oat huh h n b c d abc hei hej my is hhh
##   text1 1   1   2 3 1 0 0 0   0   0   0  0  0   0
##   text2 1   0   0 1 0 1 1 1   1   0   0  0  0   0
##   text3 0   0   0 0 0 0 0 0   0   1   1  0  0   0
##   text4 0   0   0 0 0 0 0 0   0   0   0  1  1   1

【讨论】:

    猜你喜欢
    • 2013-09-25
    • 1970-01-01
    • 2017-05-26
    • 2015-11-28
    • 1970-01-01
    • 2017-05-23
    • 2019-05-21
    • 1970-01-01
    • 2023-03-29
    相关资源
    最近更新 更多