【问题标题】:Non-equi join of data table operation数据表操作的非等连接
【发布时间】:2026-01-24 17:05:01
【问题描述】:

我想向数据表 1 添加列,这些列是对数据表 2 的操作,通过变量连接,并且数据表 2 中的日期是

数据表 1 - 我有一个提案数据集、它们的所有者和它们的编辑日期:

proposal_df <- structure(list(proposal = c(41, 62, 169, 72), owner = c("Adam", 
"Adam", "Alan", "Alan"), totalAtEdit = c(-27, 1000, 151, 1137
), editDate = structure(c(1556014200, 1560762240, 1563966600, 
1540832280), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = c(NA, 
-4L))

  proposal owner totalAtEdit            editDate
1       41  Adam         -27 2019-04-23 10:10:00
2       62  Adam        1000 2019-06-17 09:04:00
3      169  Alan         151 2019-07-24 11:10:00
4       72  Alan        1137 2018-10-29 16:58:00

数据表 2 - 我有一份提案日志以及它们的获胜或失败日期(outcome == 10):

proposal_log <- structure(list(proposal = c(9, 48, 43, 39, 45, 73, 111, 179, 
115, 146), outcome = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0), owner = c("Adam", 
"Adam", "Adam", "Adam", "Adam", "Alan", "Alan", "Alan", "Alan", 
"Alan"), totalAtEdit = c(2, 2, 4, 566, 100, 1264, 5000, 75, 493, 
18), editDate = structure(c(1557487860, 1561368780, 1561393140, 
1546446240, 1549463520, 1546614180, 1547196960, 1579603560, 1566925200, 
1536751800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = 
c(NA, 
-10L))

   proposal outcome owner totalAtEdit            editDate
1         9       0  Adam           2 2019-05-10 11:31:00
2        48       1  Adam           2 2019-06-24 09:33:00
3        43       1  Adam           4 2019-06-24 16:19:00
4        39       1  Adam         566 2019-01-02 16:24:00
5        45       0  Adam         100 2019-02-06 14:32:00
6        73       0  Alan        1264 2019-01-04 15:03:00
7       111       0  Alan        5000 2019-01-11 08:56:00
8       179       0  Alan          75 2020-01-21 10:46:00
9       115       0  Alan         493 2019-08-27 17:00:00
10      146       0  Alan          18 2018-09-12 11:30:00

我想向proposal_df 添加几列,这些列是对proposal_log 的操作,由owner 加入,其中proposal_log$editDate &lt;= proposal_df$editDate

  • countWon - outcome == 1 的提案数量
  • countLost - outcome == 0 的提案数量
  • wonValueMean - totalAtEdit 提案的平均值,其中 outcome == 1
  • pctWon - outcome == 1 的提案百分比

输出如下所示:

  proposal owner totalAtEdit            editDate countWon countLost wonValueMean    pctWon
1       41  Adam         -27 2019-04-23 10:10:00        1         1          566 0.5000000
2       62  Adam        1000 2019-06-17 09:04:00        1         2          566 0.3333333
3      169  Alan         151 2019-07-24 11:10:00        0         3          NaN 0.0000000
4       72  Alan        1137 2018-10-29 16:58:00        0         1          NaN 0.0000000

谢谢!

【问题讨论】:

  • 您想要的输出中的outcome 变量是什么意思?
  • 哎呀我忘了那些在那里。 outcomemanager 变量与问题无关,所以我删除了它们。

标签: r join data.table non-equi-join


【解决方案1】:

另一种选择是使用by=.EACHI

library(data.table)
setDT(proposal_df)
setDT(proposal_log)
proposal_df[, c("countWon","countLost","wonValueMean","pctWon") := 
    proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
        cw <- sum(outcome==1L)
        .(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
    }][, (1L:2L) := NULL]
]

【讨论】:

    【解决方案2】:

    可能有一个更优雅的解决方案,但这会在 4 个步骤中提供所需的输出。

    首先,将表设置为数据表,以便执行非等连接。

    library(data.table)
    
    setDT(proposal_df)
    setDT(proposal_log)
    

    第 1 步:同一所有者和 proposal_log$editDate

    Proposals <- proposal_log[proposal_df, on = .(owner, editDate <= editDate)]
    

    这将返回proposal_log中满足条件的提案。将较小表中的 proposaltotalAtEdit 变量添加到结果中,前缀为 i.

       proposal outcome owner totalAtEdit            editDate i.proposal i.totalAtEdit
    1:       39       1  Adam         566 2019-04-23 10:10:00         41           -27
    2:       45       0  Adam         100 2019-04-23 10:10:00         41           -27
    3:        9       0  Adam           2 2019-06-17 09:04:00         62          1000
    4:       39       1  Adam         566 2019-06-17 09:04:00         62          1000
    5:       45       0  Adam         100 2019-06-17 09:04:00         62          1000
    6:       73       0  Alan        1264 2019-07-24 11:10:00        169           151
    7:      111       0  Alan        5000 2019-07-24 11:10:00        169           151
    8:      146       0  Alan          18 2019-07-24 11:10:00        169           151
    9:      146       0  Alan          18 2018-10-29 16:58:00         72          1137
    

    第 2 步:将其重新整形为宽格式以计算 (fun=length) 每个 i.proposal 的结果数量,然后计算获胜结果的比例 (结果 = 1)。

    Outcomes <- dcast(Proposals, i.proposal ~ outcome, fun=length)[
      , pctWon := `1`/(`0`+`1`)]
    

    第 3 步:计算每个提案的获胜结果 (outcome==1) 的平均值 totalAtEdit,并使用提案 ID 上的结果进行内部连接。

    Means <- Proposals[outcome==1, .(m_total = mean(totalAtEdit)), by=i.proposal]
    Outcomes[Means, on=.(i.proposal), wonValueMean := m_total]
    

    第 4 步:将其与 proposal_df 表连接。

    proposal_df[Outcomes, on=c(proposal = "i.proposal")]
    

       proposal owner totalAtEdit            editDate 0 1    pctWon wonValueMean
    1:       41  Adam         -27 2019-04-23 10:10:00 1 1 0.5000000          566
    2:       62  Adam        1000 2019-06-17 09:04:00 2 1 0.3333333          566
    3:       72  Alan        1137 2018-10-29 16:58:00 1 0 0.0000000           NA
    4:      169  Alan         151 2019-07-24 11:10:00 3 0 0.0000000           NA
    

    【讨论】: