【发布时间】:2020-03-05 16:50:35
【问题描述】:
我有一个包含连续单词重复的句子的数据集:
数据:
df <- data.frame(
Turn = c("oh is that that steak i got the other night", # that that
"no no no i 'm dave and you 're alan", # no no no
"yeah i mean the the film was quite long though", # the the
"it had steve martin in it it 's a comedy")) # it it
目标:
我想要获得的是添加到此数据框中的另外三列:
-
df$rep_Word:指定重复单词的列 -
df$rep_Pos: 指定句子中重复单词的第一个位置的列 -
df$rep_Numb:指定单词重复次数的列
所以预期的数据框如下所示:
预期结果:
df
Turn rep_Word rep_Pos rep_Numb
1 oh is that that steak i got the other night that 4 1
2 no no no i 'm dave and you 're alan no 2 2
3 yeah i mean the the film was quite long though the 5 1
4 it had steve martin in it it 's a comedy it 7 1
迄今为止尝试的解决方案:
我的直觉是,可以使用strsplit 和函数duplicated 来获取有关重复单词、位置和重复次数的信息,例如,因此:
df_split <- apply(df, 2, function(x) strsplit(x, "\\s"))
df_split
$Turn
$Turn[[1]]
[1] "oh" "is" "that" "that" "steak" "i" "got" "the" "other" "night"
$Turn[[2]]
[1] "no" "no" "no" "i" "'m" "dave" "and" "you" "'re" "alan"
$Turn[[3]]
[1] "yeah" "i" "mean" "the" "the" "film" "was" "quite" "long" "though"
$Turn[[4]]
[1] "it" "had" "steve" "martin" "in" "it" "it" "'s" "a" "comedy"
例如,对于df 中的第一句,duplicated 显示哪个单词被重复(即duplicated 评估为TRUE 的单词),并且重复的数量和位置也可以读取-关闭该信息:
duplicated(df_split$Turn[[1]])
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
问题是我不知道如何操作duplicated,以便在df 中获得所需的添加列。非常感谢您对这项工作的帮助。
【问题讨论】:
标签: r duplicates