使用 R 从比赛中逐场抓取阵容数据答案

【问题标题】：Scraping lineup data from play by play using R使用 R 从比赛中逐场抓取阵容数据
【发布时间】：2017-01-30 15:18:15
【问题描述】：

我正在逐场处理篮球比赛，我想创建“阵容”列，其中包含稍后汇总的列表。这是一个示例数据：

        game_id team_id opp_team_id player_id period secs_remaining  action_type action_subtype
     <int>   <int>       <int>     <int>  <int>          <int>        <chr>          <chr>
1     1475       5           8       587      1            720 substitution             in
2     1475       5           8        66      1            720 substitution             in
3     1475       5           8       596      1            720 substitution             in
4     1475       5           8       206      1            720 substitution             in
5     1475       5           8       469      1            720 substitution             in
6     1475       8           5       940      1            720 substitution             in
7     1475       8           5       120      1            720 substitution             in
8     1475       8           5       124      1            720 substitution             in
9     1475       8           5      1040      1            720 substitution             in
10    1475       8           5       114      1            720 substitution             in
11    1475      NA          NA        NA      1            720         game          start
12    1475       5           8       469      1            719     jumpball            won
13    1475       8           5       114      1            718     jumpball           lost
14    1475       8           5       120      1            695        steal               
15    1475       5           8       469      1            695     turnover   ballhandling

一直在尝试使用 dplyr 的 mutate() 和列表，但每次都遇到了死胡同。预期的输出数据有望有一个新列（我将使用第 1 到 5 行作为示例）：

   id    lineup
<int>    <list>
    1    <int [5]> --> contains (587, NULL, NULL, NULL, NULL)
    2    <int [5]> --> contains (587, 66, NULL, NULL, NULL)
    3    <int [5]> --> contains (587, 66, 596, NULL, NULL)
    4    <int [5]> --> contains (587, 66, 596, 206, NULL)
    5    <int [5]> --> contains (587, 66, 596, 206, 469)

我知道将新元素附加到列表很慢，所以如果在 R 中有更好的方法来处理这个问题，我很乐意接受任何建议。

重要的是它可以处理组合。（即，一旦我总结它，向量 (1,2,3,4,5) 应该与 (2,3,4,5,1) 相同）。

提前致谢

更新

这是一个不是游戏开始的附加示例

  game_id team_id opp_team_id player_id period secs_remaining  action_type action_subtype
    <int>   <int>       <int>     <int>  <int>          <int>        <chr>          <chr>
1    1475       8           5       124      1            369       foulon               
2    1475       5           8       206      1            369 substitution            out
3    1475       5           8       125      1            369 substitution             in
4    1475       8           5      1040      1            369 substitution            out
5    1475       8           5        73      1            369 substitution             in
6    1475       8           5       124      1            358          3pt

这是那场比赛之后的第一次换人。每支球队的阵容应该是：

对于第 8 组：list(940,120,124,1040,114)

对于第 5 组：list(587,66,596,206,46)

这是预期的输出数据（仅选择阵容列）：

   id lineup
<int> <list>
    1 <int [5]> --> contains(940,120,124,1040,114) #This isn't a substitute
    2 <int [5]> --> (587,66,596,46) #This was the sub out for Team 5
    3 <int [5]> --> (587,66,596,46,125) #This was the sub in for Team 5
    4 <int [5]> --> (940,120,124,114) #This was the sub out for Team 8
    5 <int [5]> --> (940,120,124,114,73) #This was the sub in for Team 8
    6 <int [5]> --> (940,120,124,114,73) #This isn't a substitute

我最近的尝试：

dat %>%
#Initialize lineup column
mutate(lineup = NA) %>%
mutate(lineup = ifelse(
          #Check if it's the start of the game
          is.na(lag(game_id)) | lag(game_id) != game_id,
          player_id,
          #Check if it's a substitution
          ifelse(
            action_type == 'substitution',
            #Check if it's a sub in or a sub out
            ifelse(
              #Sub in
              action_subtype == 'in',
              "sub in",
              #Sub out
              "sub out"
            ),
            "not a sub"
          )
        ))

【问题讨论】：

你在搜索什么？您需要返回什么标准？您编写了哪些代码来尝试获取此信息？（仅仅说明您使用过mutate() 并不能帮助我们剖析您的问题）获得这些信息后，您可能会更幸运地获得答案。
@Badger 基本上，逻辑是：如果是游戏开始：阵容列是玩家 ID（因为任何游戏 ID 的第一次观察都是替换）。如果不是游戏开始，则检查：如果是 sub in -- 将当前行的 player_id 添加到上一行的阵容列表中。如果它是一个子输出——你将当前行的 player_id 删除到上一行的阵容列表中。如果不是sub in 或sub out，只需复制上一行的阵容列表即可。我在尝试使用 mutate() 时编辑了我当前的位置
创建list 而不仅仅是一个常规列有什么意义？也许你可以做一个更具说明性和更简单的例子？当 action_type 不是替换时，似乎会发生复杂的事情，但您的示例输出只是连续 5 个替换。一个很好的例子是有一个或两个替换（更多是不必要的），然后是一些其他的动作类型。一个更好的例子是很容易复制/粘贴（使用dput() 创建一个可复制/粘贴的 R 对象）。
@Gregor 将阵容列设为列表以便您可以在替换时删除和添加 player_ids。我将添加另一个不是 5 次直接替换的示例。
我能够使用 for 循环遍历每一行并填充阵容列。但这需要很多时间（因为 for 循环效率低下）所以我仍然想尝试一个 mutate() 解决方案。如果没有，我可能会检查 data.table 包（虽然我非常喜欢 dplyr 包）

标签： r dplyr

【解决方案1】：

我找不到使用 mutate() 的方法，所以我只好选择循环。如果有人在看，这就是答案：

calc_lineup <- function(df) {
  lineup <- setNames(list(NA,NA,NA), c("t1", "t1_lineup", "t2_lineup"))
  for (row in 1:nrow(df)) {
      if (df[row,]$checker == 'start') {
        #If Start of the Game, 
        lineup$t1 <- df[row,]$team_id
        lineup$t1_lineup <- df[row,]$player_id
        lineup$t2_lineup <- NA

      } else if (df[row,]$checker == 'sub in') {
        if(lineup$t1 == df[row,]$team_id) {
          lineup$t1_lineup %<>% c(df[row,]$player_id)
          lineup$t1_lineup = lineup$t1_lineup[!is.na(lineup$t1_lineup)]
        } else {
          lineup$t2_lineup %<>% c(df[row,]$player_id)
          lineup$t2_lineup = lineup$t2_lineup[!is.na(lineup$t2_lineup)]
        }
      } else if (df[row,]$checker == 'sub out') {
        if(lineup$t1 == df[row,]$team_id) {
          lineup$t1_lineup <- lineup$t1_lineup[lineup$t1_lineup != df[row,]$player_id]
        } else {
          lineup$t2_lineup <- lineup$t2_lineup[lineup$t2_lineup != df[row,]$player_id]
        }
      }
    df[row,]$t1_lineup <- list(lineup$t1_lineup)
    df[row,]$t2_lineup <- list(lineup$t2_lineup)       
  }
  return(df)
}

【讨论】：