【问题标题】:R - Extract the matched string, split into multiple columns which are matched by a dictionary vectorR - 提取匹配的字符串,分成由字典向量匹配的多列
【发布时间】:2018-04-30 13:58:55
【问题描述】:

我想在target 数据中提取与dictionary 匹配的favor 列的特定字符串。这是我的数据:

dictionary <- c("apple", "banana", "orange", "grape")

target <- data.frame("user" = c("A", "B", "C"),
                     "favor" = c("I like apple and banana", "grape and kiwi", "orange, banana and grape are the best"))
target
  user                                 favor
1    A               I like apple and banana
2    B                        grape and kiwi
3    C orange, banana and grape are the best

以下是我的预期结果result。我想根据我在字典中匹配的最多偏好自动创建列(在我的例子中,3),并提取我在字典中匹配的字符串。

result <- data.frame("user" = c("A", "B", "C"), 
                     "favor_1" = c("apple", "grape", "orange"), 
                     "favor_2" = c("banana", "", "banana"), 
                     "favor_3" = c("", "", "grape"))
result

  user favor_1 favor_2 favor_3
1    A   apple  banana        
2    B   grape                
3    C  orange  banana   grape

任何帮助将不胜感激。

【问题讨论】:

标签: r string


【解决方案1】:
# Remove all words from `target$favor` that are not in the dictionary
result <- lapply(strsplit(target$favor, ',| '), function(x) { x[x %in% dictionary] })
result
# [[1]]
# [1] "apple"  "banana"
# 
# [[2]]
# [1] "grape" 
# 
# [[3]]
# [1] "orange" "banana" "grape" 

# Fill in NAs when the rows have different numbers of items
result <- lapply(result, `length<-`, max(lengths(result)))

# Rebuild the data.frame using the list of words in each row
cbind(target[ , 'user', drop = F], do.call(rbind, result))
#   user      1      2     3
# 1    A  apple banana  <NA>
# 2    B  grape   <NA>  <NA>
# 3    C orange banana grape

请注意,我在 targetstringsAsFactors = FALSE 中读取,以便 strsplit 可以工作。

【讨论】:

  • 可选地,第一个 lapply 可能类似于:lapply(target$favor, function(x) regmatches(x, gregexpr(paste(dictionary, collapse = "|"), x))),因此您不必删除字符串。
  • 此外,如果您愿意,可以在作业中使用data.frame,例如:target[, c("favor_1", "favor_2", 3)] &lt;- data.frame(do.call(rbind, result))。不知道它是否更有效,但只是一种选择。总体不错的答案!
【解决方案2】:

您最好的选择可能是将str_extract_all 应用于每一行。

library(stringr)
result <- t(apply(target, 1,
                  function(x) str_extract_all(x[['favor']], dictionary, simplify = TRUE)))

     [,1]    [,2]     [,3]     [,4]   
[1,] "apple" "banana" ""       ""     
[2,] ""      ""       ""       "grape"
[3,] ""      "banana" "orange" "grape"

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-06
    • 2022-08-03
    • 1970-01-01
    • 2019-05-01
    • 2014-07-13
    相关资源
    最近更新 更多