【问题标题】:Matching strings loop over multiple columns匹配字符串在多列上循环
【发布时间】:2020-04-21 22:44:02
【问题描述】:

我有来自一项开放式调查的数据。我有一个 cmets 表和一个代码表。代码表是一组主题或字符串。

我正在尝试做的事情: 检查代码表中相关列中是否存在单词/字符串是否位于开放式注释中。在 cmets 表中为特定主题添加一个新列,并添加一个二进制 1 或 0 来表示已标记的记录。

代码表中有相当多的列,这些列是实时且不断变化的,列顺序和列数可能会发生变化。

我目前正在以一种相当复杂的方式执行此操作,我正在使用多行代码单独检查每一列,并且我认为可能有更好的方法来执行此操作。

我不知道如何让 lapply 使用 stringi 函数。

非常感谢您的帮助。

这是一组示例代码,因此您可以看到我要做什么:

#Two tables codes and comments
#codes table
codes <- structure(
  list(
    Support = structure(
      c(2L, 3L, NA),
      .Label = c("",
                 "help", "questions"),
      class = "factor"
    ),
    Online = structure(
      c(1L,
        3L, 2L),
      .Label = c("activities", "discussion board", "quiz"),
      class = "factor"
    ),
    Resources = structure(
      c(3L, 2L, NA),
      .Label = c("", "pdf",
                 "textbook"),
      class = "factor"
    )
  ),
  row.names = c(NA,-3L),
  class = "data.frame"
)
#comments table
comments <- structure(
  list(
    SurveyID = structure(
      1:5,
      .Label = c("ID_1", "ID_2",
                 "ID_3", "ID_4", "ID_5"),
      class = "factor"
    ),
    Open_comments = structure(
      c(2L,
        4L, 3L, 5L, 1L),
      .Label = c(
        "I could never get the pdf to download",
        "I didn’t get the help I needed on time",
        "my questions went unanswered",
        "staying motivated to get through the textbook",
        "there wasn’t enough engagement in the discussion board"
      ),
      class = "factor"
    )
  ),
  class = "data.frame",
  row.names = c(NA,-5L)
)

#check if any words from the columns in codes table match comments

#here I am looking for a match column by column but looking for a better way - lappy?

support = paste(codes$Support, collapse = "|")
supp_stringi = stri_detect_regex(comments$Open_comments, support)
supp_grepl = grepl(pattern = support, x = comments$Open_comments)
identical(supp_stringi, supp_grepl)
comments$Support = ifelse(supp_grepl == TRUE, 1, 0)

# What I would like to do is loop through all columns in codes rather than outlining the above code for each column in codes

【问题讨论】:

  • 你能显示输入的预期输出吗

标签: r lapply stringi


【解决方案1】:

这是一种方法,它使用 string::stri_detect_regex()lapply() 创建 TRUE = 1、FALSE = 0 的向量,具体取决于 SupportOnlineResources 向量中的任何单词是否在cmets,并将此数据与 cmets 合并回。

# build data structures from OP

resultsList <- lapply(1:ncol(codes),function(x){
     y <- stri_detect_regex(comments$Open_comments,paste(codes[[x]],collapse = "|"))
     ifelse(y == TRUE,1,0)   
     })

results <- as.data.frame(do.call(cbind,resultsList))
colnames(results) <- colnames(codes)
mergedData <- cbind(comments,results)
mergedData

...以及结果。

> mergedData
  SurveyID                                          Open_comments Support Online
1     ID_1                 I didn’t get the help I needed on time       1      0
2     ID_2          staying motivated to get through the textbook       0      0
3     ID_3                           my questions went unanswered       1      0
4     ID_4 there wasn’t enough engagement in the discussion board       0      1
5     ID_5                  I could never get the pdf to download       0      0
  Resources
1         0
2         1
3         0
4         0
5         1
> 

【讨论】:

    【解决方案2】:

    一个使用base R的衬垫:

    comments[names(codes)] <- lapply(codes, function(x) 
                +(grepl(paste0(na.omit(x), collapse = "|"), comments$Open_comments)))
    comments
    
    #  SurveyID                                          Open_comments Support Online Resources
    #1     ID_1                 I didn’t get the help I needed on time       1      0         0
    #2     ID_2          staying motivated to get through the textbook       0      0         1
    #3     ID_3                           my questions went unanswered       1      0         0
    #4     ID_4 there wasn’t enough engagement in the discussion board       0      1         0
    #5     ID_5                  I could never get the pdf to download       0      0         1
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-04-28
      • 1970-01-01
      • 2015-06-18
      • 2019-02-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多