【问题标题】:how to replace lemma in corpus obtained from wordnet in R如何替换从R中的wordnet获得的语料库中的引理
【发布时间】:2016-09-19 11:11:26
【问题描述】:

我在 R 中使用了 wordnet 库并且能够为语料库获取引理,下面是我使用过的代码。

library(tm)

doc1 <- "Stray cats are running all over the place. I see 10 a day!"
doc2 <- "Cats are killers. They kill billions of animals a year."
doc3 <- "The best food in Columbus, OH is   the North Market."
doc4 <- "Brand A is the best tasting cat food around. Your cat will love it."
doc5 <- "Buy Brand C cat food for your cat. Brand C makes healthy and happy cats."
doc6 <- "The Arnold Classic came to town this weekend. It reminds us to be healthy."
doc7 <- "I have nothing to say. In summary, I have told you nothing."


doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7)

N.docs <- length(doc.list)
names(doc.list) <- paste0("doc", c(1:N.docs))

query <- "Healthy cat food"

my.docs <- VectorSource(c(doc.list, query))
my.docs$Names <- c(names(doc.list), "query")

my.corpus <- Corpus(my.docs)
my.corpus

my.corpus <-tm_map(my.corpus,content_transformer(tolower))

#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
my.corpus <- tm_map(my.corpus, toSpace, "/")
my.corpus <- tm_map(my.corpus, toSpace, "-")
my.corpus <- tm_map(my.corpus, toSpace, ":")
my.corpus <- tm_map(my.corpus, toSpace, ";")
my.corpus <- tm_map(my.corpus, toSpace, "@")
my.corpus <- tm_map(my.corpus, toSpace, "\\(" )
my.corpus <- tm_map(my.corpus, toSpace, ")")
my.corpus <- tm_map(my.corpus, toSpace, ",")
my.corpus <- tm_map(my.corpus, toSpace, "_")
my.corpus <- tm_map(my.corpus, content_transformer(removeSpecialChars))
my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, removeWords, stopwords("en"))
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, removeWords, c("status","please","need","mail",
                                              "email","unable","re","fw","st","th","sep","nov","thank","kmmvlkm","prb"))

#Stem document
my.corpus <- tm_map(my.corpus,stemDocument)

library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")
initDict("C:/Program Files (x86)/WordNet/2.1/dict")
lapply(my.corpus,function(x){
  sapply(unlist(strsplit(as.character(x),"[[:space:]]+")), function(word) {
    x.filter <- getTermFilter("StartsWithFilter", word, TRUE)
    terms    <- getIndexTerms("NOUN",1,x.filter)
    if(!is.null(terms)) sapply(terms,getLemma)
  })
})

现在我想使用 wordnet 库将语料库中的单词替换为收到的引理,如果有人知道如何获得解决方案,请分享,这将是一个很大的帮助。

【问题讨论】:

    标签: r text-analysis lemmatization


    【解决方案1】:

    试试这个

       **Output** <- lapply(my.corpus,function(x){
       sapply(unlist(strsplit(as.character(x),"[[:space:]]+")), function(word) {
            x.filter <- getTermFilter("StartsWithFilter", word, TRUE)
            terms    <- getIndexTerms("NOUN",1,x.filter)
            if(!is.null(terms)) sapply(terms,getLemma)
          })
        })
    

    输出应该是一个列表。将该列表转换为语料库

        cprs <- as.VCorpus(Output)      
    

    然后将 crps 转换为 dtm。希望对你有帮助

    【讨论】:

      猜你喜欢
      • 2023-04-04
      • 2012-02-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-11-01
      相关资源
      最近更新 更多