【问题标题】:English dictionary based word count in RR中基于英语词典的字数
【发布时间】:2019-08-20 21:54:09
【问题描述】:

我正在尝试进行一些文本分析,我想知道是否有任何工具或包可以识别不同形式的英语单词(例如单数、复数、过去、现在等)并获得字数。

在这个字符串向量 myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized") 中,我想获取单词 Fire = 4 和单词 Hospital = 5 的计数。

【问题讨论】:

  • @r2evans 那会给fired fires firing hospital hospitals Hospitals 2 1 1 1 1 1
  • MAPK,我删除了它,显然不够快:-)。你试过NLP 包吗?如果包本身不足,它的 revdeps 可能会为它的进一步发展提供线索。

标签: r


【解决方案1】:

查看Stemming 技术。

词干 - 减少屈折(或有时派生)的过程 词根形式。 (例如“关闭”将是“关闭”的根, “关闭”、“关闭”、“关闭”等)。

install.packages("tm")
library(tm)

mydf <- data.frame(doc_id = seq(1:9), 
                    text = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized"), 
                    stringsAsFactors = FALSE)

mycorpus <- SimpleCorpus(DataframeSource(mydf))

mytmmap <- tm_map(mycorpus, stemDocument, language = "english")  

inspect(mycorpus)

inspect(mytmmap)

# <<SimpleCorpus>>
# Metadata:  corpus specific: 1, document level (indexed): 0
# Content:  documents: 9
#
#     1      2      3      4      5      6      7      8      9 
#  fire   fire   fire   fire hospit Hospit hospit hospit Hospit 

【讨论】:

    【解决方案2】:

    更好的选择是stringdist,但这会起作用

    f1 <- function(patVec, vec, nameVec) {
           out <- colSums(sapply(patVec, agrepl, x = vec,
                 max.distance = 0.1, ignore.case = TRUE))
           names(out) <- nameVec
           out
        }
            
    o1 <-  f1(c("fire", "hospital"), myvec, c("Fire", "Hospital"))
              
    o1
    #    Fire Hospital 
    #       4        3 
    

    对于第二个向量

    o1 <- f1(c("fire", "hospital"), myvec2, c("Fire", "Hospital"))
    o1
    #    Fire Hospital 
    #      4        5 
    

    或使用soundex

    library(phonics)
    o2 <- table(substr(soundex(myvec), 1, 2))
    names(o2) <- c("Fire", "Hospital")
    o2
    #   Fire Hospital 
    #      4        3 
    

    对于第二个向量

    o2 <- table(substr(soundex(myvec2), 1, 2))
    names(o2) <- c("Fire", "Hospital")
    o2
    #    Fire Hospital 
    #       4        5 
    

    所有方法都在 OP 的帖子中给出了预期的输出

    数据

    myvec <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital")
    myvec2 <- c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")
    

    【讨论】:

    • 它不适用于不同类型的单词。仅限 (e|i)
    • @MAPK 根据您显示的示例,agrep(已修改)和soundex 有效。如果您有新的矢量,请更新您的帖子。没有它,反对票似乎不公平
    【解决方案3】:

    使用Quanteda 库的词干提取示例。 https://quanteda.io/

    install.packages("quanteda")
    
    library(quanteda)
    
    mytext = c("fired", "fires", "firing", "fired", "hospitals", "Hospitals", "hospital", "hospitalization", "Hospitalized")
    
    mytoks <- tokens(mytext)
    
    toks_stem <- tokens_wordstem(mytoks, "english")
    # tokens from 9 documents.
    #[1] "fire",  "fire", "fire", "fire", "hospit", "Hospit", "hospit", "hospit", "Hospit"
    

    Quanteda 备忘单 - https://github.com/rstudio/cheatsheets/blob/master/quanteda.pdf

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-03-17
      • 2011-06-10
      • 1970-01-01
      • 1970-01-01
      • 2012-03-03
      • 2016-07-18
      • 2011-07-06
      • 1970-01-01
      相关资源
      最近更新 更多