【问题标题】:Spread a data.frame with repetitive column传播具有重复列的 data.frame
【发布时间】:2020-02-22 08:24:57
【问题描述】:

我有一个大的 data.frame,我正在尝试传播。一个玩具示例如下所示。

data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))

head(data)

 date ticker value
1 2019    SPY     1
2 2020    SPY     2
3 2019   MSFT     3
4 2020   MSFT     4

我想传播它,使 data.frame 看起来像这样。

spread(data, key = ticker, value = value)
  date MSFT SPY
1 2019    3   1
2 2020    4   2

但是,当我在实际的 data.frame 上执行此操作时,出现错误。

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882

下面是我的data.frame的头尾

head(df)
ref.date   ticker weeklyReturn
  <date>     <chr>         <dbl>
1 2008-02-01 SPY         NA     
2 2008-02-04 SPY         NA     
3 2008-02-05 SPY         NA     
4 2008-02-06 SPY         NA     
5 2008-02-07 SPY         NA     
6 2008-02-08 SPY         -0.0478

tail(df)
ref.date   ticker weeklyReturn
  <date>     <chr>         <dbl>
1 2020-02-12 MDYV        0.00293
2 2020-02-13 MDYV        0.00917
3 2020-02-14 MDYV        0.0179 
4 2020-02-18 MDYV        0.0107 
5 2020-02-19 MDYV        0.00422
6 2020-02-20 MDYV        0.00347

【问题讨论】:

  • 您是否检查过上述行,至少是前几行?
  • 您的数据有重复,因此无法进行整形。你需要先以某种方式聚合,然后传播。顺便说一句,使用pivot_wider(names_from=ticker, values_from = value)spread 处于“退休”状态。 :D

标签: r tidyverse spread


【解决方案1】:

您可以使用dplyrtidyr 包。要消除该错误,您必须首先对每个组的值求和。

data %>%
  group_by(date, ticker) %>%
  summarise(value = sum(value)) %>%
  pivot_wider(names_from = ticker, values_from = value)

# date  MSFT  SPY
# <fct> <dbl> <dbl>
#  1 2019  3     1
#  2 2020  4     2

【讨论】:

    【解决方案2】:

    正如 cmets 中所说,对于相同的日期代码组合,您有多个值。您需要定义如何处理它。
    这里有一个代表:

    library(tidyr)
    library(dplyr)
    
    # your data is more like:
    data = data.frame(
      date = c(2019, rep(c("2019", "2020"), 2)), 
      ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"), 
      value = c(8, 1, 2, 3, 4))
    
    # With two values for same date-ticker combination
    data
    #>   date ticker value
    #> 1 2019    SPY     8
    #> 2 2019    SPY     1
    #> 3 2020    SPY     2
    #> 4 2019   MSFT     3
    #> 5 2020   MSFT     4
    
    # Results in error
    data %>% 
      spread(ticker, value)
    #> Error: Each row of output must be identified by a unique combination of keys.
    #> Keys are shared for 2 rows:
    #> * 1, 2
    
    # New pivot_wider() Creates list-columns for duplicates
    data %>% 
      pivot_wider(names_from = ticker, values_from = value,)
    #> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
    #> * Use `values_fn = list(value = list)` to suppress this warning.
    #> * Use `values_fn = list(value = length)` to identify where the duplicates arise
    #> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
    #> # A tibble: 2 x 3
    #>   date  SPY       MSFT     
    #>   <fct> <list>    <list>   
    #> 1 2019  <dbl [2]> <dbl [1]>
    #> 2 2020  <dbl [1]> <dbl [1]>
    
    # Otherwise, decide yourself how to summarise duplicates with mean() for instance
    data %>% 
      group_by(date, ticker) %>% 
      summarise(value = mean(value, na.rm = TRUE)) %>% 
      spread(ticker, value)
    #> # A tibble: 2 x 3
    #> # Groups:   date [2]
    #>   date   MSFT   SPY
    #>   <fct> <dbl> <dbl>
    #> 1 2019      3   4.5
    #> 2 2020      4   2
    

    由 reprex 包于 2020-02-22 创建 (v0.3.0)

    【讨论】: