【问题标题】:Tricky conditional imputation, ideally using Tidyverse棘手的条件插补,理想情况下使用 Tidyverse
【发布时间】:2021-07-17 18:34:13
【问题描述】:

我有一个问题,我需要在标记这些插补值时对缺失值进行一些棘手的条件插补,但我不知道如何处理它。

我的数据是整齐(长)格式。我想要做的是生成一个完整的数据集,其中每个“状态”都有一组完整的行,其中包含“男性”、“女性”和“总”的“出生”值。如果某个州缺少“Total”,则从该“州”的“Male”+“Female”估算。如果我们有“Total”,但没有“Male”或“Female”,则缺失的“Births”值是根据“Total”-“Male”(或“Female”,取决于缺失的内容)计算得出的。

但是,只有当“源”对于该州的所有当前行都相同时,才能估算缺失值。 我们不能基于组合来自不同来源的数据进行估算。最后,所有估算的行都应该有它们的父状态和来源,并且应该有一个用于二进制“聚合”列的“1”标志。

reprex 在下面,期望的结果示例在下面,并有一个快速解释。如果可能的话,我想用 Tidyverse 来做这件事,但我愿意接受更好的解决方案。提前谢谢你!

sex <- c("Male", "Female", "Total", "Male", "Female", "Male", "Female", "Male", "Total") 
state <- c("New Jersey", "New Jersey", "New Jersey", "Vermont", "Vermont", "Washington", "Washington", "Montana", "Montana")
source <- c("WHO", "WHO", "WHO", "CDC", "CDC", "UN", "CDC", "UN", "UN")
aggregated <- c(0, 0, 0, 0, 0, 0, 0, 0, 0)
births <- c(20, 30, 50, 15, 16, 20, 27, 15, 33)

df <- data.frame(sex, state, source, aggregated, births)
df
     sex      state source aggregated births
1   Male New Jersey    WHO          0     20
2 Female New Jersey    WHO          0     30
3  Total New Jersey    WHO          0     50
4   Male    Vermont    CDC          0     15
5 Female    Vermont    CDC          0     16
6   Male Washington     UN          0     20
7 Female Washington    CDC          0     27
8   Male    Montana     UN          0     15
9  Total    Montana     UN          0     33

生成集说明

新泽西:从一开始就完成,没有变化

佛蒙特州:缺少总计,所有来源相同 (CDC),新行创建的总计是根据男性 + 女性推算的出生人数

华盛顿:缺少总计,但男性和女性的来源不同,因此无法估算

蒙大拿州:缺少女性,所有来源都相同 (UN),新的女性行是根据总出生数 - 男性创建的。

      sex      state source aggregated births
1    Male New Jersey    WHO          0     20
2  Female New Jersey    WHO          0     30
3   Total New Jersey    WHO          0     50
4    Male    Vermont    CDC          0     15
5  Female    Vermont    CDC          0     16
6   Total    Vermont    CDC          1     31
7    Male Washington     UN          0     20
8  Female Washington    CDC          0     27
9    Male    Montana     UN          0     15
10 Female    Montana     UN          1     18
11  Total    Montana     UN          0     33

【问题讨论】:

    标签: r tidyverse aggregation imputation


    【解决方案1】:

    03 年更新 现在我可以好好休息了!

    我知道这与亲爱的@akrun 提出的那 2 个绝妙的解决方案相比算不了什么。但是我不能在这里留下一个不会导致所需输出的解决方案。所以我做了一些修改,结果如下,另外我扩展了代码以防births 列中的Male 值丢失。

    library(dplyr)
    library(tidyr)
    
    df %>%
      pivot_wider(names_from = sex, values_from = births) %>%
      pivot_longer(Male:Total, names_to = "sex", values_to = "births") %>%
      group_split(state, source) %>% 
      map_dfr(~ if(sum(is.na(.x$births)) > 1 ) drop_na(.x) else .x) %>%
      group_by(state, source) %>%
      mutate(aggregated = ifelse(is.na(births), 1, 0),
             births = ifelse(sex == "Female" & is.na(births), births[sex == "Total"] - 
                               births[sex == "Male"], 
                             ifelse(sex == "Total" & is.na(births), 
                                    births[sex == "Female"] + births[sex == "Male"], 
                                    ifelse(sex == "Male" & is.na(births), 
                                           births[sex == "Total"] - births[sex == "Female"], 
                                           births)))) %>%
      relocate(state, source, sex)
    
    
    # A tibble: 11 x 5
    # Groups:   state, source [5]
       state      source sex    aggregated births
       <chr>      <chr>  <chr>       <dbl>  <dbl>
     1 Montana    UN     Male            0     15
     2 Montana    UN     Female          1     18
     3 Montana    UN     Total           0     33
     4 New Jersey WHO    Male            0     20
     5 New Jersey WHO    Female          0     30
     6 New Jersey WHO    Total           0     50
     7 Vermont    CDC    Male            0     15
     8 Vermont    CDC    Female          0     16
     9 Vermont    CDC    Total           1     31
    10 Washington CDC    Female          0     27
    11 Washington UN     Male            0     20
    
    

    更新

    感谢我亲爱的老师/朋友@akrun 的绝妙解决方案,aggregated 专栏的问题得到了解决:

    library(dplyr)
    library(tibble)
    
    df %>% 
      group_split(state, source) %>% 
      map_dfr(~ if(all(c('Male', 'Female') %in% .x$sex) && !'Total' %in% .x$sex)  
        { add_row(.x, sex = 'Total', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) } 
              else if(all(c('Male', 'Total') %in% .x$sex) && !'Female' %in% .x$sex) 
                { add_row(.x, sex = 'Female', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) } 
        else .x)
    
    
    # A tibble: 11 x 5
       sex    state      source aggregated births
       <chr>  <chr>      <chr>       <dbl>  <dbl>
     1 Male   Montana    UN              0     15
     2 Total  Montana    UN              0     33
     3 Female Montana    UN              1     48
     4 Male   New Jersey WHO             0     20
     5 Female New Jersey WHO             0     30
     6 Total  New Jersey WHO             0     50
     7 Male   Vermont    CDC             0     15
     8 Female Vermont    CDC             0     16
     9 Total  Vermont    CDC             1     31
    10 Female Washington CDC             0     27
    11 Male   Washington UN              0     20
    
    

    02 年更新

    亲爱的@akrun 的另一个很好的解决方案:

    
    df %>% 
      group_by(state, source) %>% 
      complete(sex = unique(df$sex)) %>% 
      arrange(state, source, factor(sex, levels = c('Male', 'Female', 'Total'))) %>% 
      filter(sum(is.na(aggregated)) > 1 & !is.na(aggregated)|sum(is.na(aggregated)) <= 1) %>% 
      mutate(aggregated = replace(aggregated, is.na(aggregated), 1), 
             births = case_when(is.na(births) &  row_number() == n() ~ sum(births, na.rm = TRUE), 
                                is.na(births) ~ last(births) - na.omit(births)[1], TRUE ~ births))
    
    # A tibble: 11 x 5
    # Groups:   state, source [5]
       state      source sex    aggregated births
       <chr>      <chr>  <chr>       <dbl>  <dbl>
     1 Montana    UN     Male            0     15
     2 Montana    UN     Female          1     18
     3 Montana    UN     Total           0     33
     4 New Jersey WHO    Male            0     20
     5 New Jersey WHO    Female          0     30
     6 New Jersey WHO    Total           0     50
     7 Vermont    CDC    Male            0     15
     8 Vermont    CDC    Female          0     16
     9 Vermont    CDC    Total           1     31
    10 Washington CDC    Female          0     27
    11 Washington UN     Male            0     20
    
    

    【讨论】:

    • 你可以做df %&gt;% group_split(state, source) %&gt;% map_dfr(~ if(all(c('Male', 'Female') %in% .x$sex) &amp;&amp; !'Total' %in% .x$sex) { add_row(.x, sex = 'Total', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births))} else .x)
    • 我意识到对于蒙大拿州需要再添加一行,您可以为此指定else if 条件
    • 或者你可以使用 adorn_totals 和条件
    • 否则,它将是 NA,您可以稍后 fill 与之前的非 NA
    • 没关系。你可以提我的名字
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-11-18
    • 2011-08-02
    • 2017-09-13
    相关资源
    最近更新 更多