基于列表对象的子集数据框答案

【问题标题】：Subset dataframe based on list objects基于列表对象的子集数据框
【发布时间】：2021-04-25 04:50:07
【问题描述】：

我在Turn 列中有一个带有语音数据的数据框：

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)

我想对那些在cos 和/或because 之前至少有四个字的行上的数据框进行子集化。为此，我在Turn 中计算了cos 和because 的索引：

test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6
4           so you can't, cos well, ah because you know why!  4, 7
5                             not now because it's too late!     3

一行中有多个索引。这就是为什么我尝试这样的子集化失败的原因：

test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) : 
  (list) object cannot be coerced to type 'double'

如何通过忽略第二个列出的Index 值来对test 进行子集化？

预期结果：

test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

我将不胜感激任何答案，包括不通过索引使用绕行但在子集过程中使用regex 模式的答案。

编辑：

sapply 范式中的解决方案非常简单，只需选择所列对象的第一个值：

sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3

【问题讨论】：

不清楚你认为一个词是什么。为什么预期的输出只是这三个字符串？
为什么第 4 行不匹配？ because之前有6个字？
第 4 行不匹配，因为 cos 已经在第 4 位。
@WiktorStribiżew 我认为任何用空格分隔的单词，即Hi. 算作一个单词，就像she'd've 算作一个单词一样。
如果cos 或because 出现在字符串中的第一个、第二个、第三个或第四个单词，我是否理解正确？

标签： r regex subset

【解决方案1】：

我希望这会给你一个想法：

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)
rx <- "^\\s*(?:\\S+\\s+){0,3}(?:cos|because)\\b.*(*SKIP)(*F)|(?:\\S+[\\s,]+){4}\\b(cos|because)\\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\\b(cos|because)\\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test

输出：

                                                       Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

请参阅R demo 和main regex demo。

正则表达式详细信息：

^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F) - 匹配stirng 的开头，然后是零到三个单词，然后是cos 或because 作为整个单词和字符串的其余部分，然后跳过匹配项
| - 或
(?:\S+[\s,]+){4}\b(cos|because)\b - 匹配 cos 或 because 前面有四个单词。

【讨论】：

解决方案真的是这样的：Turn <- test[grepl(rx, test$Turn, perl=TRUE),]。为什么我需要剩余的代码 (split <- strsplit(Turn, "\\b(cos|because)\\b") Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\\s+"))+1) test <- data.frame(Turn, Index, stringsAsFactors = F))？
@ChrisRuehlemann 好吧，你显示了预期的输出，我试着跟随。

【解决方案2】：

基于 tidyverse 的解决方案如下所示。

library(dplyr)
library(purrr)
library(stringr)

test %>%
  mutate(index = map(str_split(Turn, ' '), 
                     ~ str_which(., 'cos|because')[1])) %>%
  filter(index >= 5)

#                                                         Turn index
# 1                             Hi. I'm you an' you are me cos     8
# 2    she'd've been so happy cos with all this stuff goin' on     5
# 3 but we're in this together, because y' know things happens     6

【讨论】：

我收到这个错误：Error: Problem with 'mutate()' input 'Index'. x could not find function "map" ℹ Input 'Index' is map(str_split(Turn, " "), ~str_which(., "cos|because")[1]). Run 'rlang::last_error()' to see where the error occurred.
@ChrisRuehlemann：错误消息是有道理的，因为我的帖子中缺少library(purrr)。现已添加。