【问题标题】:NLP - identifying and replacing words (synonyms) in RNLP - 识别和替换 R 中的单词(同义词)
【发布时间】:2017-02-21 08:07:41
【问题描述】:

我对 R 中的代码有疑问。

我有一个包含 4 列和超过 600k 观察值的数据集(问题),其中一列名为“V3”。 本专栏有诸如“今天是什么日子?”之类的问题。 我有第二个数据集(voc),有 2 列,其中一列名称为“单词”,另一列名称为“同义词”。如果在我的第一个数据集(问题)中存在来自“同义词”列的第二个数据集(voc)的单词,那么我想将其替换为“单词”列中的单词。

questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)  

                      V3                                                                                            
1 what is the day today?                                                                                             
2     Tom has brown eyes  

voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)

     word synonyms                                                                                                    
1 weather      day                                                                                               
2       a      the                                                                                                   
3    blue    brown 

Desired output

                      V3                        V5                                                                                  
1 what is the day today?  what is a weather today?                                                                                          
2     Tom has brown eyes         Tom has blue eyes

我写了简单的代码,但它不起作用。

for (k in 1:nrow(question))
{
    for (i in 1:nrow(voc))
   {
      question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
   }
}

也许有人会帮助我? :)

我写了第二个代码,但它也不起作用..

for( i in 1:nrow(questions))
{
    for( j in 1:nrow(voc))
      {
        if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
        {
            new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
            questions[i,]=new   
        }
    }
    questions = cbind(questions,c(new))
}

【问题讨论】:

  • 您的问题不太可能吸引答案,请提供一些示例数据(涉及的数据帧的前几行),期望输出的示例也很好。
  • 好的! :) 谢谢你的建议

标签: r nlp gsub


【解决方案1】:

首先,在程序级别或数据导入期间使用stringsAsFactors = FALSE 选项很重要。这是因为除非您另有说明,否则 R 默认将字符串转换为因子。因子在建模中很有用,但是您想对文本本身进行分析,因此您应该确保您的文本没有被强制转换为因子。

我解决这个问题的方法是编写一个函数,将每个字符串“分解”成一个向量,然后使用 match 替换单词。向量再次重新组合成一个字符串。

我不确定这将在您的 600K 记录中表现如何。您可能会查看一些处理字符串的 R 包,例如 stringrstringi,因为它们可能具有执行其中某些功能的函数。 match 在速度上往往还可以,但 %in% 可能是真正的野兽,具体取决于字符串的长度和其他因素。

# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)  

voc <- cbind(word = c("weather","a","blue"),
             synonyms = c("day","the","brown"))
voc <- data.frame(voc)

# This function takes:
#  - an input string
#  - a vector of words to replace
#  - a vector of the words to use as replacements
# It returns a list of the original input and the changed version    
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {

    # Start by breaking the input string into a vector
    # Note that we use [[1]] to get first list element of strsplit output
    # Obviously this relies on breaking sentences by spacing
    orig_words <- strsplit(x = input_string,split = " ")[[1]]

    # If we find at least one of the words to replace in the original words, proceed
    if(sum(orig_words %in% words_to_repl) > 0) {

        # The right side selects the elements of orig_words that match words to be replaced
        # The left side uses match to find the numeric index of those replacements within the words_to_repl vector
        # This numeric vector is used to select the values from repl_words
        # These then replace the values in orig_words
        orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]

        # We rebuild the sentence again, and return a list with original and new version
        new_sent <- paste(orig_words,collapse = " ")
        return(list(original = input_string,new = new_sent))
    } else {

        # Otherwise we return the original version since no changes are needed
        return(list(original = input_string,new = input_string))
    }
}

# Using do.call and rbind.data.frame, we can collapse the output of a lapply()

do.call(what = rbind.data.frame,
        args = lapply(X = questions$V3,
                      FUN = uFunc_FindAndReplace,
                      words_to_repl = voc$synonyms,
                      repl_words = voc$word))

>
                original                      new
1 What is the day today? What is a weather today?
2     Tom has brown eyes        Tom has blue eyes

【讨论】:

  • 干得好!非常感谢 :) 它在我的大数据集上正常工作
猜你喜欢
  • 2017-07-13
  • 1970-01-01
  • 1970-01-01
  • 2014-01-08
  • 2017-10-07
  • 1970-01-01
  • 1970-01-01
  • 2010-10-11
  • 2018-05-01
相关资源
最近更新 更多