【发布时间】:2021-04-25 04:50:07
【问题描述】:
我在Turn 列中有一个带有语音数据的数据框:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
我想对那些在cos 和/或because 之前至少有四个字的行上的数据框进行子集化。为此,我在Turn 中计算了cos 和because 的索引:
test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
4 so you can't, cos well, ah because you know why! 4, 7
5 not now because it's too late! 3
一行中有多个索引。这就是为什么我尝试这样的子集化失败的原因:
test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) :
(list) object cannot be coerced to type 'double'
如何通过忽略第二个列出的Index 值来对test 进行子集化?
预期结果:
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
我将不胜感激任何答案,包括不通过索引使用绕行但在子集过程中使用regex 模式的答案。
编辑:
sapply 范式中的解决方案非常简单,只需选择所列对象的第一个值:
sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3
【问题讨论】:
-
不清楚你认为一个词是什么。为什么预期的输出只是这三个字符串?
-
为什么第 4 行不匹配?
because之前有6个字? -
第 4 行不匹配,因为
cos已经在第 4 位。 -
@WiktorStribiżew 我认为任何用空格分隔的单词,即
Hi.算作一个单词,就像she'd've算作一个单词一样。 -
如果
cos或because出现在字符串中的第一个、第二个、第三个或第四个单词,我是否理解正确?