【问题标题】:How to extract the word before a certain word in a string?如何提取字符串中某个单词之前的单词?
【发布时间】:2020-06-15 16:11:05
【问题描述】:

我有一个数据框,其中“leg_activity”列的每一行都是逗号分隔的字符串:

structure(list(id = c("100", "100060", "100073", "100098", "100102", 
"100104", "100125", "100128", "100149", "100217", "100220", "100271", 
"100464", "100465", "100520", "100607", "100653", "100745", "100757", 
"100760"), leg_activity = c("home", "home, car, work, car, leisure, car, other, car, leisure, car, work, car, shop, car, home", 
"home, walk, leisure, walk, leisure, walk, home", "home, car, other, car, shop, car, other, car, home", 
"home, car, work, car, home, car, home", "home", "home, walk, education, walk, home", 
"home, car, other, car, work, car, shop, car, shop, car, home", 
"home, car, shop, car, work, car, home", "home, bike, leisure, bike, home", 
"home, walk, shop, walk, home", "home, pt, leisure, car, leisure, pt, home", 
"home, car, education, car, home", "home, car, leisure, car, home", 
"home, walk, home, walk, shop, walk, home", "home, pt, work, walk, leisure, walk, work, pt, home", 
"home, pt, leisure, walk, leisure, walk, home", "home, walk, home, bike, shop, bike, home", 
"home, pt, work, pt, home, walk, work, walk, home", "home")), row.names = c(2L, 
15L, 20L, 24L, 31L, 33L, 40L, 43L, 48L, 70L, 73L, 93L, 147L, 
148L, 156L, 174L, 188L, 213L, 214L, 220L), class = "data.frame")

在每个字符串中,我想提取出现在单词work 之前的单词。 work 可以出现多次,每次都需要提取或统计前面的单词。

最后,我有兴趣计算在整个 df 中 work 之前出现的频率。

我尝试过的:

library(dplyr)
library(stringr)

df%>%
  separate_rows(leg_activity, sep = "work, ") %>%
  group_by(id) %>%
  mutate(n = row_number()) %>%
  pivot_wider(names_from = n, values_from = leg_activity) 

显然,这不会导致结果,而只会将 df 分成列。所以也许另一种方法更合适。

非常感谢您的帮助!

【问题讨论】:

  • 这能回答你的问题吗? Regex to return the word before the match
  • @Limey 问题似乎是一样的,是的。但我不知道 C#。我正在寻找 R 中的解决方案
  • 我的错,我错过了 c-sharp 标签。尽管如此,正则表达式是要走的路。看看stringr 包。 This 看起来更相关。 ;)
  • gsub('(\\w+,)(?=\\s*work)|.', '\\1', df$leg_activity, perl = TRUE) 然后你可以根据,拆分成列

标签: r string dplyr stringr


【解决方案1】:

首先,数据集稍微小一点,以便更容易跟踪代码的结果:

d = data.frame(id = 1:3, leg = c("home",
                                 "work, R, eat, work",
                                 "eat, work, R, work"), stringsAsFactors = FALSE) 

", " 上拆分字符串 (strsplit)。循环遍历结果列表 (lapply)。获取“工作”的索引(which(x == "work")),获取上一个索引(-1)。如果“work”是第一个单词,则使用pmax 获取空向量。索引单词 (x[<the-index>])。取消列出和计数项目 (table(unlist(...)。

table(unlist(lapply(strsplit(d$leg, ", "), function(x) x[pmax(0, which(x == "work") - 1)])))
# eat   R 
#   2   1 

鉴于“最终,我有兴趣计算在整个 df 工作之前哪个单词出现的频率。”,似乎不需要分组。

【讨论】:

    【解决方案2】:

    您可以使用separate_rows 仅使用逗号来让您的单词出现在不同的行上。然后,在按id 分组后,您可以filter 下一行/前导行有“工作”的行?

    library(dplyr)
    
    df %>%
      separate_rows(leg_activity, sep = ",") %>%
      mutate(leg_activity = trimws(leg_activity)) %>%
      group_by(id) %>%
      filter(lead(leg_activity) == "work") %>%
      summarise(count = n())
    

    输出

    # A tibble: 6 x 2
      id     count
      <chr>  <int>
    1 100060     2
    2 100102     1
    3 100128     1
    4 100149     1
    5 100607     2
    6 100757     2
    

    【讨论】:

      【解决方案3】:
      library(stringr)
      WantedStrings <- sub(", work","",str_extract_all(df$leg_activity, "\\w+, work",simplify=T))
      WantedStrings <- WantedStrings[WantedStrings != ""]
      
      table(WantedStrings)
      
      
      WantedStrings
       car   pt walk 
         5    2    2
      

      【讨论】:

        【解决方案4】:

        基础R一班轮:

           table(unlist(strsplit(gsub("(\\w+\\,)\\s*(work\\,)", "\\1", 
                                   lst$leg_activity), ", ")))
        

        【讨论】:

          猜你喜欢
          • 2021-10-14
          • 2023-02-14
          • 2012-04-13
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-11-11
          相关资源
          最近更新 更多