【问题标题】:How can I create a column that cumulatively adds the sum of two previous rows based on conditions?如何创建一个根据条件累积添加前两行总和的列?
【发布时间】:2017-12-06 15:44:31
【问题描述】:

我之前试过问这个问题,但它是不是说得不好。这是一个新的尝试,因为我还没有解决它。

我有一个包含获胜者、失败者、日期、获胜者点数和失败者点数的数据集。

对于每一行,我想要两个新列,一个用于获胜者,一个用于失败者,显示他们到目前为止获得了多少分(作为获胜者和失败者)。

示例数据:

winner <- c(1,2,3,1,2,3,1,2,3)
loser <-  c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)

我希望输出是:

winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)

到目前为止,我解决它的方法是执行一个 for 循环,例如:

library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0

for(i in row.names(test_data)) {
  test_data[i,]$winner_points_sum_loop <-
    (
    test_data %>%
      dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
      dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
  +
    test_data %>%
      dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
      dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
    )
}

test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)

有什么建议可以解决这个问题吗?当行号加起来时,查询需要相当长的时间。我已经尝试使用 AVE 函数进行详细说明,我可以在一列中将玩家的分数加起来为赢家,但不知道如何将他们的分数添加为输家。

【问题讨论】:

  • 我不明白winner_points_sum 应该是什么。是它上面行中所有winner_points 的总和吗?你能澄清一下吗?
  • 我完全糊涂了。 winnerloser 点是什么意思?为什么是赢家 1 和输家 3?你是如何到达winner_pointsloser_points 的。循环有什么作用?请澄清一些?
  • winner_points_sum 应该是他们之前所有比赛中获胜者得分的总和,无论是赢家还是输家。赢家和输家是ID,只是示例,与积分相同。 @MattW @D Pinto

标签: r for-loop cumulative-sum


【解决方案1】:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <-  c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)


library(dplyr)
library(tidyr)

test_data %>%
  unite(winner, winner, winner_points) %>%                    # unite winner columns
  unite(loser, loser, loser_points) %>%                       # unite loser columns
  gather(type, pl_pts, winner, loser, -date) %>%              # reshape
  separate(pl_pts, c("player","points"), convert = T) %>%     # separate columns
  arrange(date) %>%                                           # order dates (in case it's not)
  group_by(player) %>%                                        # for each player
  mutate(sum_points = cumsum(points) - points) %>%            # get points up to that date
  ungroup() %>%                                               # forget the grouping
  unite(pl_pts_sumpts, player, points, sum_points) %>%        # unite columns
  spread(type, pl_pts_sumpts) %>%                             # reshape
  separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>%                # separate columns and give appropriate names
  separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
  select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum)   # select the order you prefer


# # A tibble: 9 x 7
#   winner loser       date winner_points loser_points winner_points_sum loser_points_sum
# *  <int> <int>     <date>         <int>        <int>             <int>            <int>
# 1      1     3 2017-10-01             2            1                 0                0
# 2      2     1 2017-10-02             1            0                 0                2
# 3      3     1 2017-10-03             2            1                 1                2
# 4      1     2 2017-10-04             1            0                 3                1
# 5      2     1 2017-10-05             2            1                 1                4
# 6      3     1 2017-10-06             1            0                 3                5
# 7      1     3 2017-10-07             2            1                 5                4
# 8      2     1 2017-10-08             1            0                 3                7
# 9      3     2 2017-10-09             2            1                 5                4

【讨论】:

  • 是的,这是我从未想过的解决方法。谢谢!
【解决方案2】:

我终于明白你想要什么了。我采取了一种方法,即获取每个玩家在每个时间点的累积积分,然后将其加入原始的test_data 数据框。

winner <- c(1,2,3,1,2,3,1,2,3)
loser <-  c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)

library(dplyr)
library(tidyr)

cum_points <- test_data %>% 
  gather(end_game_status, player_id, winner, loser) %>% 
  gather(which_point, how_many_points, winner_points, loser_points) %>% 
  filter(
    (end_game_status == "winner" & which_point == "winner_points") | 
      (end_game_status == "loser" & which_point == "loser_points")) %>% 
  arrange(date = as.Date(date)) %>% 
  group_by(player_id) %>% 
  mutate(cumulative_points = cumsum(how_many_points)) %>% 
  mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
  select(player_id, date, cumulative_points)

output <- test_data %>% 
  left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>% 
  rename(winner_points_sum = cumulative_points_sofar) %>% 
  left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>% 
  rename(loser_points_sum = cumulative_points_sofar)
output

【讨论】:

    【解决方案3】:

    previous question of the OP 的不同之处在于,OP 现在询问每位玩家到目前为止(即在实际日期之前)所获得的积分的累积总和。此外,示例数据集现在包含一个 date 列,用于唯一标识每一行。

    所以,my previous approach 也可以在这里使用,但需要进行一些修改。下面的解决方案将数据从宽格式重塑为长格式,从而同时重塑两个值变量,计算每个玩家 id 的累积和,最后再次从长格式重塑为宽格式。为了仅对实际日期之前的得分求和,各行滞后一。

    请务必注意,winnerloser 列包含各自的玩家 ID。

    library(data.table)
    cols <- c("winner", "loser")
    setDT(test_data)[
      # reshape multiple value variables simultaneously from wide to long format
      , melt(.SD, id.vars = "date", 
             measure.vars = list(cols, paste0(cols, "_points")), 
             value.name = c("id", "points"))][
               # rename variable column
               , variable := forcats::lvls_revalue(variable, cols)][
                 # order by date and cumulate the lagged points by id
                 order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
                   # reshape multiple value variables simultaneously from long to wide format
                   , dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
    
             date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
    1: 2017-10-01         1        3             2            1                 0                0
    2: 2017-10-02         2        1             1            0                 0                2
    3: 2017-10-03         3        1             2            1                 1                2
    4: 2017-10-04         1        2             1            0                 3                1
    5: 2017-10-05         2        1             2            1                 1                4
    6: 2017-10-06         3        1             1            0                 3                5
    7: 2017-10-07         1        3             2            1                 5                4
    8: 2017-10-08         2        1             1            0                 3                7
    9: 2017-10-09         3        2             2            1                 5                4
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-05-19
      • 1970-01-01
      • 2021-09-02
      • 2020-04-04
      • 1970-01-01
      • 2020-10-16
      • 2021-12-07
      • 1970-01-01
      相关资源
      最近更新 更多