【发布时间】:2020-06-15 16:11:05
【问题描述】:
我有一个数据框,其中“leg_activity”列的每一行都是逗号分隔的字符串:
structure(list(id = c("100", "100060", "100073", "100098", "100102",
"100104", "100125", "100128", "100149", "100217", "100220", "100271",
"100464", "100465", "100520", "100607", "100653", "100745", "100757",
"100760"), leg_activity = c("home", "home, car, work, car, leisure, car, other, car, leisure, car, work, car, shop, car, home",
"home, walk, leisure, walk, leisure, walk, home", "home, car, other, car, shop, car, other, car, home",
"home, car, work, car, home, car, home", "home", "home, walk, education, walk, home",
"home, car, other, car, work, car, shop, car, shop, car, home",
"home, car, shop, car, work, car, home", "home, bike, leisure, bike, home",
"home, walk, shop, walk, home", "home, pt, leisure, car, leisure, pt, home",
"home, car, education, car, home", "home, car, leisure, car, home",
"home, walk, home, walk, shop, walk, home", "home, pt, work, walk, leisure, walk, work, pt, home",
"home, pt, leisure, walk, leisure, walk, home", "home, walk, home, bike, shop, bike, home",
"home, pt, work, pt, home, walk, work, walk, home", "home")), row.names = c(2L,
15L, 20L, 24L, 31L, 33L, 40L, 43L, 48L, 70L, 73L, 93L, 147L,
148L, 156L, 174L, 188L, 213L, 214L, 220L), class = "data.frame")
在每个字符串中,我想提取出现在单词work 之前的单词。 work 可以出现多次,每次都需要提取或统计前面的单词。
最后,我有兴趣计算在整个 df 中 work 之前出现的频率。
我尝试过的:
library(dplyr)
library(stringr)
df%>%
separate_rows(leg_activity, sep = "work, ") %>%
group_by(id) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = leg_activity)
显然,这不会导致结果,而只会将 df 分成列。所以也许另一种方法更合适。
非常感谢您的帮助!
【问题讨论】:
-
这能回答你的问题吗? Regex to return the word before the match
-
@Limey 问题似乎是一样的,是的。但我不知道 C#。我正在寻找 R 中的解决方案
-
我的错,我错过了 c-sharp 标签。尽管如此,正则表达式是要走的路。看看
stringr包。 This 看起来更相关。 ;) -
gsub('(\\w+,)(?=\\s*work)|.', '\\1', df$leg_activity, perl = TRUE)然后你可以根据,拆分成列