使用“tidyverse”在重复测量设计中有效去除“NAs”答案

【问题标题】：Efficiently removing `NAs` in repeated measures designs using `tidyverse`使用“tidyverse”在重复测量设计中有效去除“NAs”
【发布时间】：2026-01-31 08:15:02
【问题描述】：

这不是关于如何做某事的问题，而是更多关于如何有效地做某事的问题。特别是，我想在重复测量设计中删除NAs，这样每个组都有完整的观察结果。

在下面的bugs_long 数据框中，同一参与者参与了四个condition 并报告他们的desire 以杀死每种情况下的错误。现在，如果我想对这个数据集进行一些重复测量分析，这通常不适用于长格式，因为在成对排除NAs 后，每个组的观察数量不同。所以最终的数据框应该省略以下五个主题。

# setup
set.seed(123)
library(ipmisc)
library(tidyverse)

# looking at the NAs
dplyr::filter(bugs_long, is.na(desire)) 
#> # A tibble: 5 x 6
#>   subject gender region        education condition desire
#>     <int> <fct>  <fct>         <fct>     <chr>      <dbl>
#> 1       2 Female North America advance   LDHF          NA
#> 2      80 Female North America less      LDHF          NA
#> 3      42 Female North America high      HDLF          NA
#> 4      64 Female Europe        some      HDLF          NA
#> 5      10 Female Other         high      HDHF          NA

这是我正在破解它并让它工作的当前迂回方式：

# figuring out the number of levels in the grouping factor
x_n_levels <- nlevels(as.factor(bugs_long$condition))[[1]]

# removing observations that don't have all repeated values
df <-
  bugs_long %>%
  filter(!is.na(condition)) %>%
  group_by(condition) %>%
  mutate(id = dplyr::row_number()) %>%
  ungroup(.) %>%
  filter(!is.na(desire)) %>%
  group_by(id) %>%
  mutate(n = dplyr::n()) %>%
  ungroup(.) %>%
  filter(n == x_n_levels) %>%
  select(-n)

# did this work? yes
df %>%
  group_by(condition) %>%
  count()
#> # A tibble: 4 x 2
#> # Groups:   condition [4]
#>   condition     n
#>   <chr>     <int>
#> 1 HDHF         88
#> 2 HDLF         88
#> 3 LDHF         88
#> 4 LDLF         88

但如果tidyverse (dplyr + tidyr) 没有更有效的方法来实现这一点，我会感到惊讶，如果其他人有更好的重构方法，我将不胜感激。

【问题讨论】：

标签： r dplyr tidyverse tidyr data-cleaning

【解决方案1】：

您实际上使这比需要的复杂得多。一旦找到要排除的案例，只需删除数据中与这些主题匹配的行，即反连接，这只是一项简单的任务。一些有用的讨论here 和here。

set.seed(123)
library(ipmisc)
library(dplyr)

exclude <- filter(bugs_long, is.na(desire))
full_cases <- bugs_long %>%
  anti_join(exclude, by = "subject")

或者一次性完成过滤和反加入，类似于你在 SQL 中所做的：

bugs_long %>%
  anti_join(filter(., is.na(desire)), by = "subject")

无论哪种方式，案件的数量都会被检查出来：

count(full_cases, condition)
#> # A tibble: 4 x 2
#>   condition     n
#>   <chr>     <int>
#> 1 HDHF         88
#> 2 HDLF         88
#> 3 LDHF         88
#> 4 LDLF         88

【讨论】：