【问题标题】:Add a unqiue ID by name and date in R [duplicate]在R中按名称和日期添加唯一ID [重复]
【发布时间】:2019-11-01 09:06:31
【问题描述】:

我正在做一些数据清理/格式化,我想按名称和日期为每条记录添加一个唯一标识符。例如,“Bob”可能有四个签到日期,其中两个是相同的。对于这样的情况,我想给他三个不同的(连续的)身份证号码。

这是我得到的最接近我想要的结果:


我创建的示例数据集:


tst <- data_frame(
  name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
  date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02", "2003-08-21")),
  amount = round(runif(14, 0, 100), 2)
)

正在生成一个check_in_number 变量...

tst2 <- tst %>%
  arrange(date) %>%
  group_by(name, date) %>%
  mutate(check_in_number = row_number())

上面的行将为 Bob 生成check_in_number,依次为1121。我希望输出为1223。换一种说法。我希望将同一日期的签到实例视为一次签到。

tidyverse 可以做到这一点吗?我是否忽略了一个简单的方法?


这里有一个类似的问题,但我将其搁置,因为我所涉及的问题涉及我正在安排数据的有序日期变量。换句话说,我的数据要求我的新变量是连续的。

How to number/label data-table by group-number from group_by?

【问题讨论】:

    标签: r dataframe tidyverse


    【解决方案1】:

    你需要group_indices:

    library(tidyverse)
    
    tst <- tibble(
      name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
      date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02", "2003-08-21")),
      amount = round(runif(14, 0, 100), 2)
    )
    
    tst %>%
      arrange(name, date) %>%
      mutate(check_in_number = group_indices(., name, date))
    #> # A tibble: 14 x 4
    #>    name  date       amount check_in_number
    #>    <chr> <date>      <dbl>           <int>
    #>  1 Bob   2003-08-21  91.1                1
    #>  2 Bob   2009-07-03  38.1                2
    #>  3 Bob   2009-07-03  28.3                2
    #>  4 Bob   2015-12-31  22.3                3
    #>  5 Roger 2002-01-03  68.3                4
    #>  6 Roger 2009-07-03  83.8                5
    #>  7 Roger 2015-06-02  94.2                6
    #>  8 Roger 2019-02-10  48.8                7
    #>  9 Sam   2005-04-17  16.6                8
    #> 10 Sam   2010-08-12  93.2                9
    #> 11 Sam   2010-09-21  65.5               10
    #> 12 Stacy 2012-11-12  92.6               11
    #> 13 Stacy 2014-10-10  84.4               12
    #> 14 Stacy 2016-04-01   7.43              13
    

    如果您需要在每个名称上重新编号,您可以根据每个名称中的第一个值重新缩放:

    tst %>%
      arrange(name, date) %>%
      mutate(check_in_number = group_indices(., name, date)) %>%
      group_by(name) %>%
      mutate(check_in_number = check_in_number - first(check_in_number) + 1)
    #> # A tibble: 14 x 4
    #> # Groups:   name [4]
    #>    name  date       amount check_in_number
    #>    <chr> <date>      <dbl>           <dbl>
    #>  1 Bob   2003-08-21  91.1                1
    #>  2 Bob   2009-07-03  38.1                2
    #>  3 Bob   2009-07-03  28.3                2
    #>  4 Bob   2015-12-31  22.3                3
    #>  5 Roger 2002-01-03  68.3                1
    #>  6 Roger 2009-07-03  83.8                2
    #>  7 Roger 2015-06-02  94.2                3
    #>  8 Roger 2019-02-10  48.8                4
    #>  9 Sam   2005-04-17  16.6                1
    #> 10 Sam   2010-08-12  93.2                2
    #> 11 Sam   2010-09-21  65.5                3
    #> 12 Stacy 2012-11-12  92.6                1
    #> 13 Stacy 2014-10-10  84.4                2
    #> 14 Stacy 2016-04-01   7.43               3
    

    reprex package (v0.3.0) 于 2019 年 6 月 18 日创建

    【讨论】:

    • 酷。我不知道group_indices
    • 有没有办法让每个名字都从 1 开始?所以鲍勃会去1223,而罗杰会去1234在同一个变量列?
    • 计算check_in号后可以重新缩放,试试%&gt;% group_by(name) %&gt;% mutate(check_in_number = check_in_number - first(check_in_number) + 1)
    【解决方案2】:

    data.table 的选项

    library(data.table)
    setDT(tst)[order(name, date)][, check_in_number := .GRP, .(name, date)][]
    #      name       date amount check_in_number
    # 1:   Bob 2003-08-21  66.36               1
    # 2:   Bob 2009-07-03  22.18               2
    # 3:   Bob 2009-07-03  96.15               2
    # 4:   Bob 2015-12-31  31.64               3
    # 5: Roger 2002-01-03  92.32               4
    # 6: Roger 2009-07-03  41.85               5
    # 7: Roger 2015-06-02  15.46               6
    # 8: Roger 2019-02-10  80.38               7
    # 9:   Sam 2005-04-17  49.18               8
    #10:   Sam 2010-08-12  73.57               9
    #11:   Sam 2010-09-21  49.37              10
    #12: Stacy 2012-11-12  24.82              11
    #13: Stacy 2014-10-10  23.31              12
    #14: Stacy 2016-04-01  80.12              13
    

    如果我们需要重新编号

    setDT(tst)[order(name, date)][, check_in_number := .GRP, 
       .(name, date)][,  check_in_number := match(check_in_number, 
              unique(check_in_number)), .(name)][]
    #      name       date amount check_in_number
    # 1:   Bob 2003-08-21  66.36               1
    # 2:   Bob 2009-07-03  22.18               2
    # 3:   Bob 2009-07-03  96.15               2
    # 4:   Bob 2015-12-31  31.64               3
    # 5: Roger 2002-01-03  92.32               1
    # 6: Roger 2009-07-03  41.85               2
    # 7: Roger 2015-06-02  15.46               3
    # 8: Roger 2019-02-10  80.38               4
    # 9:   Sam 2005-04-17  49.18               1
    #10:   Sam 2010-08-12  73.57               2
    #11:   Sam 2010-09-21  49.37               3
    #12: Stacy 2012-11-12  24.82               1
    #13: Stacy 2014-10-10  23.31               2
    #14: Stacy 2016-04-01  80.12               3
    

    数据

    tst <- data_frame(
      name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
      date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02", 
        "2003-08-21")),
      amount = round(runif(14, 0, 100), 2)
    )
    

    【讨论】:

      猜你喜欢
      • 2021-12-17
      • 2012-05-06
      • 1970-01-01
      • 2021-05-05
      • 1970-01-01
      • 1970-01-01
      • 2021-06-18
      • 2019-08-19
      • 2017-10-03
      相关资源
      最近更新 更多