根据年龄和会员ID创建户主答案

【问题标题】：Create household head based on age and member id根据年龄和会员ID创建户主
【发布时间】：2021-07-27 03:15:56
【问题描述】：

我有一个家庭成员数据框，其中包含 3 个整数列，“hid”、“sub”和“age”。我想在数据框中创建一个新的逻辑变量，名为“hh”，代表户主，定义如下：

如果家庭中只有 1 个成员，则值为 TRUE，
如果家庭中有 2 名或更多成员，则户主为 18 至 65 岁（含）且在 18 至 65 岁之间具有最小主体 ID（“子”）的人。
如果家庭中没有 18 至 65 岁的成员，则户主是主体 ID 最小的人。

每个家庭必须有 1 个且只有 1 个户主。

我的数据如下所示：

# A tibble: 10 x 3
     hid   sub   age
   <dbl> <dbl> <dbl>
 1     1     1    75
 2     1     2    55
 3     2     1    35
 4     3     1    69
 5     3     2    72
 6     4     1    69
 7     5     1    15
 8     5     2    17
 9     5     3    42
10     6     1    72

我希望结果是这样的：

> result
# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE  # Not 18-65 & there is another aged 18-65 within this household.
 2     1     2    55 TRUE   # Aged 18-65 and the smallest sub id within this household.
 3     2     1    35 TRUE   # Only 1 in this household.
 4     3     1    69 TRUE   # Not aged 18-65, but no other member is and smallest sub id.
 5     3     2    72 FALSE  # Not aged 18-65, and not the smallest sub id.
 6     4     1    69 TRUE   # Only 1 in this household.
 7     5     1    15 FALSE  # Not aged 18-65 and others in this household qualify.
 8     5     2    17 FALSE  # Not aged 18-65 and others in this household qualify.
 9     5     3    42 TRUE   # Aged 18-65 and the smallest sub id among those aged 18-65 within this household.
10     5     4    62 FALSE  # Aged 18-65 but not the smallest sub id among those aged 18-65 within this household.

谢谢！

d <- structure(list(hid = c(1, 1, 2, 3, 3, 4, 5, 5, 5, 5), 
                      sub = c(1, 2, 1, 1, 2, 1, 1, 2, 3, 4),
                      age = c(75, 55, 35, 69, 72, 69, 15, 17, 42, 62)), 
                 row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】：

您尝试过哪些不起作用的方法？对于像这样的逻辑可以复杂的事情，手动绘制代码的表示（例如决策树）以将其分开会很有帮助。制作临时变量以跟踪不同的条件也很有帮助，例如家庭中的人数，是 18-65 岁的人等，而不是试图将所有逻辑合并为一个步骤

标签： r dplyr data-manipulation data-management

【解决方案1】：

您可以arrange 数据，使每组的第一行是您要查找的hh 值。

library(dplyr)

d %>%
  arrange(hid, !between(age, 18, 65), sub) %>%
  mutate(hh = !duplicated(hid)) 

#     hid   sub   age hh   
#   <dbl> <dbl> <dbl> <lgl>
# 1     1     2    55 TRUE 
# 2     1     1    75 FALSE
# 3     2     1    35 TRUE 
# 4     3     1    69 TRUE 
# 5     3     2    72 FALSE
# 6     4     1    69 TRUE 
# 7     5     3    42 TRUE 
# 8     5     4    62 FALSE
# 9     5     1    15 FALSE
#10     5     2    17 FALSE

!between(age, 18, 65) 将安排数据，将 18-65 岁的个人排在该范围之外的其他人之前。

【讨论】：

【解决方案2】：

这是一个选项

library(dplyr)
d %>% 
    group_by(hid) %>%
     mutate(hh = if(n() == 1) TRUE else if(n() > 1 & 
         !any(between(age, 18, 65))) age == min(age) else
        age == min(age[between(age, 18, 65)])) %>%
    ungroup

-输出

# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE

或者另一个简化的选项是

d %>% 
    mutate(rn = row_number()) %>%
    arrange(hid, sub, age) %>%
    group_by(hid) %>% 
    mutate(hh = age == coalesce(age[between(age, 18, 65)][1], 
           first(age))) %>% 
    ungroup %>%
    arrange(rn) %>%
    select(-rn)

-输出

# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE

【讨论】：

“sub”变量在哪里？
嗨@akrun。假设最后一所（#5）成员的年龄是 15、17、62 和 42。我希望第三个成员成为户主。基于“子”变量的最小值，而不是最小年龄。谢谢。 :)
@Edward 你能用一个新的例子更新你的帖子吗？第二种解决方案按 subid 排序
是的，第二种解决方案工作正常。第一个，不太好。 :P 谢谢！！
@Edward 谢谢，我错过了sub 部分。您在帖子中显示的输入数据与dput 略有不同

【解决方案3】：

case_when 的选项，每个 case_when 都将您的条件 1 到 3 转换为代码：

library(dplyr)

d %>% 
    group_by(hid) %>% 
    mutate(hh = case_when(max(sub) == 1 ~ TRUE,
                          max(sub) > 1 & 
                              between(age, 18, 65) &
                              sub == min(sub[between(age, 18, 65)]) ~ TRUE,
                          max(between(age, 18, 65)) < 1 & 
                              sub == min(sub[max(between(age, 18, 65)) < 1]) ~ TRUE,
                          TRUE ~ FALSE))

输出：

     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE

【讨论】：