【问题标题】:Filter rows with conditions使用条件过滤行
【发布时间】:2021-06-03 03:32:46
【问题描述】:

我有一个具有不同speakers 和utterances 的数据框,其中重叠发生在[...] 中; df 还包含 starttime_msendtime_ms 值:

df <- data.frame(
  speaker = c("A", "B", "B", "B", "C", "C", "B", "A"),
  utterance = c("hi [there] long time no [see] how're things", "[howdy]", 
                "[yeah]", "we're [good]", "[great]", "[really]", "yeah [fine]", "[and y]ourself?"),
  starttime_ms = c(10, 25, 444, 1133, 1400, 1567, 1800, 1974),
  endtime_ms = c(1100, 125, 555, 1566, 1566, 1700, 2000, 2111)
)

我需要过滤那些行 (i) utterance 完全(从头到尾)包含 [...] 表达式和 (ii) endtime_ms 小于 endtime_ms 通过 prior扬声器

预期结果是这样的:

# A tibble: 4 x 5
# Groups:   grp [2]
  speaker utterance starttime_ms endtime_ms   grp
  <chr>   <chr>            <dbl>      <dbl> <int>
1 B       [howdy]             25        125     2
2 B       [yeah]             444        555     2

我可以根据条件(i)过滤:

library(data.table)
library(dplyr)
df %>% 
  group_by(grp = rleid(speaker)) %>% 
  filter(grepl("^\\[[^][]+\\]$", utterance)) 

但我不知道如何实现条件 (ii);使用&amp; lag(endtime_ms) &gt; endtime_ms 作为filter 的另一个条件不起作用。

【问题讨论】:

  • "endtime_ms is small" 似乎不对:555 不小于 125。
  • (看来您使用的是data.table::rleid,最好包含它。)
  • 澄清一下,“endtime_ms 小于前一位发言者的endtime_ms”是指在按第一个条件过滤之前分组数据?
  • @LeonardoViotti 正确!

标签: r dplyr


【解决方案1】:

如果您想坚持使用tidyverse

library(tidyverse)

df <- data.frame(
  speaker = c("A", "B", "B", "B", "C", "C", "B", "A"),
  utterance = c("hi [there] long time no [see] how're things", "[howdy]", 
                "[yeah]", "we're [good]", "[great]", "[really]", "yeah [fine]", "[and y]ourself?"),
  starttime_ms = c(10, 25, 444, 1133, 1400, 1567, 1800, 1974),
  endtime_ms = c(1100, 125, 555, 1566, 1566, 1700, 2000, 2111)
)

df %>%
  mutate(
    speaker2 = lag(speaker),
    endtime_ms2 = lag(endtime_ms)
  ) %>%
  mutate(endtime_ms2 = case_when(
    speaker == speaker2 ~ NA_real_,
    TRUE ~ endtime_ms2
  )) %>%
  fill(endtime_ms2) %>%
  filter(endtime_ms < endtime_ms2) %>%
  select(-contains("2"))
#>   speaker utterance starttime_ms endtime_ms
#> 1       B   [howdy]           25        125
#> 2       B    [yeah]          444        555

【讨论】:

  • 但是如果有一个utterance不是 [...] 并且仍然有endtime_ms 小于前一位发言人的utteranceendtime_ms 怎么办?
  • 我认为您的解决方案最接近我的想法。如果您能解决我之前评论中提出的问题,我愿意接受。也就是说,如何在只选择与^[...]$ 模式匹配的utterances 中构建约束?
  • 我想我明白了:filter(grepl("^\\[[^][]+\\]$", utterance))
【解决方案2】:

这是data.table 解决您的问题..

library( data.table )
setDT(df) #make it a data.table
#first filter in [...] strings in utterance
ans <- df[ grepl("^\\[.*\\]$", utterance ), ]
#create a temparary column with the maximum endtime_ms of the previous group
ans[, temp := shift( ans[, max(endtime_ms), 
                         by = rleid(speaker)]$V1)[.GRP], 
    by = rleid(speaker)]
#now filtering is easy, drop the temp-column afterward
ans[ is.na(temp) | endtime_ms < temp, ][, temp := NULL][]
#    speaker utterance starttime_ms endtime_ms
# 1:       B   [howdy]           25        125
# 2:       B    [yeah]          444        555

【讨论】:

  • 编辑:忘记按 rleid(speaker) 分组...现在已修复
【解决方案3】:

如果我理解正确你的问题。下面的代码应该可以做到。

library(magrittr)
library(dplyr)
library(data.table)

df <- data.frame(
  speaker = c("A", "B", "B", "B", "C", "C", "B", "A"),
  utterance = c("hi [there] long time no [see] how're things", "[howdy]", 
                "[yeah]", "we're [good]", "[great]", "[really]", "yeah [fine]", "[and y]ourself?"),
  starttime_ms = c(10, 25, 444, 1133, 1400, 1567, 1800, 1974),
  endtime_ms = c(1100, 125, 555, 1566, 1566, 1700, 2000, 2111)
)


# Get min endtime by speaker
min_endtime_df <- 
  df %>% group_by(speaker) %>% 
    summarise(endtime_ms_l = min(endtime_ms)) %>% 
  # Shift the speakers to merge
  mutate(speaker = lead(speaker))

# Merge with the previous group
df %>% 
  merge(min_endtime_df,
        by = 'speaker',
        all = T) %>%
  # First condition
  filter(endtime_ms_l > endtime_ms) %>% 
  # Second conditon 
  group_by(grp = rleid(speaker)) %>%
  filter(grepl("^\\[[^][]+\\]$", utterance))

【讨论】:

    猜你喜欢
    • 2018-02-21
    • 1970-01-01
    • 1970-01-01
    • 2016-07-14
    • 1970-01-01
    • 2020-05-25
    • 1970-01-01
    • 2021-09-16
    • 2022-01-01
    相关资源
    最近更新 更多