【问题标题】:Comparing values of a certain row with a certain number of previous rows in data.table将某一行的值与 data.table 中一定数量的先前行进行比较
【发布时间】:2020-07-16 02:24:37
【问题描述】:

这是之前问过的 question 的扩展。

在包含公司和类别值的数据库中,我想计算这个: 如果一家公司进入了一个以前没有参与过的新类别三(3)年(不包括同一年),那么该条目将被标记为“NEW”,否则将被标记为“OLD”。

在以下数据集中:

df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
                 category = c("A","A","B","C","A","D","F","F","C","A","B"))

期望的结果是:

 year category Newness
 1: 1979        A     NEW
 2: 1979        A     NEW
 3: 1980        B     NEW
 4: 1980        C     NEW
 5: 1981        A     NEW
 6: 1981        D     NEW
 7: 1982        F     NEW
 8: 1983        F     OLD
 9: 1983        C     OLD
10: 1984        A     OLD
11: 1984        B     NEW

非常感谢。

【问题讨论】:

  • 感谢您告诉我。编辑了数据框。
  • 第 5 行应该是旧的而不是新的吗?
  • @chinsoon12 它会在旧示例下。随着更新的df,该公司在过去三年(1981、1982、1983)没有涉及B类,因此是NEW。不过,您的代码工作得很好。

标签: r group-by data.table rows


【解决方案1】:

这里有一些选项。

1)mult使用非equi 自连接

df[, yrsago := year - 3L]
df[, Newness := 
    c("OLD", "NEW")[1L + df[df, on=.(category, year>=yrsago, year<year), mult="first", is.na(x.category)]]
]

2)by=.EACHI 使用非equi 自连接:

df[, yrsago := year - 3L]
df[, Newness2 := 
    c("OLD", "NEW")[1L + df[df, on=.(category, year>=yrsago, year<year), by=.EACHI, .N==0L]$V1]
]

3)使用应该是最快的滚动连接

df[, q := year - 0.1]
df[, Newness3 := 
    df[df, on=.(category, year=q), roll=3L, fifelse(is.na(x.year), "NEW", "OLD")]
]

输出:

    year category yrsago Newness Newness2      q Newness3
 1: 1979        A   1976     NEW      NEW 1978.9      NEW
 2: 1979        A   1976     NEW      NEW 1978.9      NEW
 3: 1980        B   1977     NEW      NEW 1979.9      NEW
 4: 1980        C   1977     NEW      NEW 1979.9      NEW
 5: 1981        A   1978     OLD      OLD 1980.9      OLD
 6: 1981        D   1978     NEW      NEW 1980.9      NEW
 7: 1982        F   1979     NEW      NEW 1981.9      NEW
 8: 1983        F   1980     OLD      OLD 1982.9      OLD
 9: 1983        C   1980     OLD      OLD 1982.9      OLD
10: 1984        A   1981     OLD      OLD 1983.9      OLD
11: 1984        B   1981     NEW      NEW 1983.9      NEW

数据:

df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
    category = c("A","A","B","C","A","D","F","F","C","A","B"))

【讨论】:

  • 滚动连接确实是最快的。无论如何我可以按组(公司)迭代它?
  • 是的,只需在 on 参数中添加组
  • 谢谢。速度太快了!
【解决方案2】:

使用mapply

df$Newness <- c('NEW', 'OLD')[mapply(function(x, y) any(y == df$category
                [df$year < x & df$year >= (x - 3)]), df$year, df$category) + 1]
df

#    year category Newness
# 1: 1979        A     NEW
# 2: 1979        A     NEW
# 3: 1980        B     NEW
# 4: 1980        C     NEW
# 5: 1980        A     OLD
# 6: 1981        D     NEW
# 7: 1981        F     NEW
# 8: 1982        F     OLD
# 9: 1982        C     OLD
#10: 1982        A     OLD
#11: 1982        B     OLD

【讨论】:

    【解决方案3】:

    这不是答案,只是发布所提供解决方案的时间基准,应用于我正在处理的专利数据库的一部分:

    > df[, yrsago := year - 3L]
    > df[, q := year - 0.1]
    > tbench <- bench::mark(time_unit="s",
    +                     sol_1 = df[, Newness := c('NEW', 'OLD')[mapply(function(x, y) any(y == df$category[df$year < x & df$year >= (x - 3)]), df$year, df$category) + 1]],
    +                    sol_2 = 
    +                      df[, Newness := c("OLD", "NEW")[1L + df[df, on=.(category, year>=yrsago, year<year), mult="first",
    +                                                              is.na(x.category)]]],
    +                    sol_3 = df[, Newness2 := c("OLD", "NEW")[1L + df[df, on=.(category, year>=yrsago, year<year),
    +                                                                     by=.EACHI, .N==0L]$V1]],
    +                    
    +                    sol_4 = 
    +                      df[, Newness3 := df[df, on=.(category, year=q), roll=3L, fifelse(is.na(x.year), "NEW", "OLD")]],
    +                    
    +                    min_time = 1
    + )
    > 
    > tbench
    # A tibble: 4 x 13
      expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory      time    gc     
      <bch:expr>   <dbl>   <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list>       <list>      <list>  <list> 
    1 sol_1      0.144   0.192        5.53     321MB     1.11     5     1      0.905 <data.table~ <Rprofmem[~ <bch:t~ <tibbl~
    2 sol_2      0.00611 0.00629    159.       406KB     1.09   146     1      0.921 <data.table~ <Rprofmem[~ <bch:t~ <tibbl~
    3 sol_3      0.00632 0.00647    154.       406KB     1.07   144     1      0.936 <data.table~ <Rprofmem[~ <bch:t~ <tibbl~
    4 sol_4      0.00405 0.00416    238.       393KB     0      238     0      1.00  <data.table~ <Rprofmem[~ <bch:t~ <tibbl~
    

    感谢大家的帮助。

    【讨论】:

      猜你喜欢
      • 2020-06-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-10-16
      • 2022-08-23
      • 1970-01-01
      • 2022-01-09
      相关资源
      最近更新 更多