【问题标题】:Calculating an average for unique value combinations [duplicate]计算唯一值组合的平均值 [重复]
【发布时间】:2017-09-23 17:44:31
【问题描述】:

我有一个包含以下列的数据集:

locID       = the location of ID of the observer
yr          = the year of the observation in categorical format: P_year
maxFlock    = a number counted by the observer
lat         = latitude of the location
long        = longitude of the location
state       = US state of the observation
effortDays  = categorical, I, II, III, and IV
effortHours = categorical, A, B, C, D 

这是数据框的示例:

PData

  locID    yr  maxFlock     lat   long state effortDays effortHours 
L4278   P_2000        3   41.42 -73.67    NY        II           C
L4278   P_2000        6   41.42 -73.67    NY       III           C
L4278   P_2000        4   41.42 -73.67    NY       III           C
L4278   P_2012        2   41.42 -73.67    NY       III           B
L4278   P_2012        4   41.42 -73.67    NY        IV           B
L4278   P_2012        8   41.42 -73.67    NY        IV           B
L10494  P_2003        4   42.01 -77.44    NY        IV           C
L10494  P_2003        0   42.01 -77.44    NY        IV           C
L10494  P_2003        8   42.01 -77.44    NY        IV           D
L10494  P_2005        4   42.01 -77.44    NY        IV           C
L10494  P_2005        6   42.01 -77.44    NY        IV           C
L10494  P_2009        8   42.01 -77.44    NY        IV           C

我想创建一个新列(标记为:xmf)来计算 maxFlock 的平均值。但是,必须为 locID、yr、effortDays 和 effortHours 的每个唯一组合计算平均值。如果我在上面的示例上运行代码,最终产品将如下所示。

PData

  locID     yr maxFlock         xmf        lat    long  state effortDays effortHours 
L4278   P_2000        3           3       41.42 -73.67    NY        II           C
L4278   P_2000        6           5       41.42 -73.67    NY       III           C
L4278   P_2000        4           5       41.42 -73.67    NY       III           C
L4278   P_2012        2           2       41.42 -73.67    NY       III           B
L4278   P_2012        4           6       41.42 -73.67    NY        IV           B
L4278   P_2012        8           6       41.42 -73.67    NY        IV           B
L10494  P_2003        4           2       42.01 -77.44    NY        IV           C
L10494  P_2003        0           2       42.01 -77.44    NY        IV           C
L10494  P_2003        8           8       42.01 -77.44    NY        IV           D
L10494  P_2005        4           5       42.01 -77.44    NY        IV           C
L10494  P_2005        6           5       42.01 -77.44    NY        IV           C
L10494  P_2009        8           8       42.01 -77.44    NY        IV           C

我最初尝试使用:

PData$xmf = ave(myData2$maxFlock, myData2$locID, myData2$yr, myData2$effortDays, myData2$effortHours)

但它不起作用(在等待半个多小时后不得不杀死它),我什至不确定 ave() 是否可以做我想要它做的事情。

我正在考虑尝试使用 split-apply-combine 方法,但我不认为这正是我想要的,因为我必须为 locID 子集,然后为 year 子集,然后为努力时间或努力日,我不想做出这样的选择。我想通过独特的组合来做到这一点。

如果有一种快速的方法来做到这一点,那就太好了。我正在处理的数据大约有 250 万行,因此 for 循环内的 if 语句绝对不理想。

谢谢!

【问题讨论】:

  • 也许使用包data.table 类似这样的东西:library(data.table) setDT(PData) PData[, xmf = ave(maxFlock), by = .(locID, yr, effortDays, effortHours)] 。进行更多调整以获得您想要的结果应该很容易。

标签: r dataframe subset split-apply-combine


【解决方案1】:

来自dplyr的解决方案。

library(dplyr)

PData <- PData %>%
  group_by(locID, yr, effortDays, effortHours) %>%
  mutate(xmf = mean(maxFlock)) %>%
  select(c(1:3, 9, 4:8))
PData
# A tibble: 12 x 9
# Groups:   locID, yr, effortDays, effortHours [8]
    locID     yr maxFlock   xmf   lat   long state effortDays effortHours
    <chr>  <chr>    <int> <dbl> <dbl>  <dbl> <chr>      <chr>       <chr>
 1  L4278 P_2000        3     3 41.42 -73.67    NY         II           C
 2  L4278 P_2000        6     5 41.42 -73.67    NY        III           C
 3  L4278 P_2000        4     5 41.42 -73.67    NY        III           C
 4  L4278 P_2012        2     2 41.42 -73.67    NY        III           B
 5  L4278 P_2012        4     6 41.42 -73.67    NY         IV           B
 6  L4278 P_2012        8     6 41.42 -73.67    NY         IV           B
 7 L10494 P_2003        4     2 42.01 -77.44    NY         IV           C
 8 L10494 P_2003        0     2 42.01 -77.44    NY         IV           C
 9 L10494 P_2003        8     8 42.01 -77.44    NY         IV           D
10 L10494 P_2005        4     5 42.01 -77.44    NY         IV           C
11 L10494 P_2005        6     5 42.01 -77.44    NY         IV           C
12 L10494 P_2009        8     8 42.01 -77.44    NY         IV           C

数据

PData <- read.table(text = "  locID    yr  maxFlock     lat   long state effortDays effortHours 
L4278   P_2000        3   41.42 -73.67    NY        II           C
                 L4278   P_2000        6   41.42 -73.67    NY       III           C
                 L4278   P_2000        4   41.42 -73.67    NY       III           C
                 L4278   P_2012        2   41.42 -73.67    NY       III           B
                 L4278   P_2012        4   41.42 -73.67    NY        IV           B
                 L4278   P_2012        8   41.42 -73.67    NY        IV           B
                 L10494  P_2003        4   42.01 -77.44    NY        IV           C
                 L10494  P_2003        0   42.01 -77.44    NY        IV           C
                 L10494  P_2003        8   42.01 -77.44    NY        IV           D
                 L10494  P_2005        4   42.01 -77.44    NY        IV           C
                 L10494  P_2005        6   42.01 -77.44    NY        IV           C
                 L10494  P_2009        8   42.01 -77.44    NY        IV           C
                 ",
                 header = TRUE, stringsAsFactors = FALSE)

【讨论】:

  • 谢谢!看起来像我想做的。你能解释一下select(c(1:3, 9, 4:8)) 在做什么吗?谢谢!!
  • @Heliornis 是一种指定列索引数的方法。在mutate 调用之后,xmf 列是数据框的最后一列。然后我将xml 的位置更改为与您所需的输出相同。如果xml的位置不重要,这一步是可选的。
  • 哇哦,明白了,谢谢!
【解决方案2】:

您可以创建一个新列,它结合了四列(locID、yr、effortDays、effortHours)。然后tapply 将新列设为INDEX,然后简单地提取值。

grouping <- paste(PData$locID,
                  PData$yr,
                  PData$effortDays,
                  PData$effortHours, sep = "_")
agg.vals <- tapply(PData$maxFlock, INDEX = grouping, FUN = mean)
PData["xmf"] <- agg.vals[grouping]

【讨论】:

    【解决方案3】:
    df <- aggregate(PData$maxFlock, by = list(PData$locID, PData$yr, PData$effortDays, PData$effortHours), FUN = mean)
    names(df) <- c("locID", "yr", "effortDays", "effortHours", "xmf")
    
    df
    
        locID   yr   effortDays effortHours   xmf
    1   L4278  P_2012     III       B         2
    2   L4278  P_2012      IV       B         6
    3   L4278  P_2000      II       C         3
    4   L4278  P_2000     III       C         5
    5  L10494  P_2003      IV       C         2
    6  L10494  P_2005      IV       C         5
    7  L10494  P_2009      IV       C         8
    8  L10494  P_2003      IV       D         8
    

    【讨论】:

      猜你喜欢
      • 2016-01-10
      • 2014-04-20
      • 1970-01-01
      • 2020-12-12
      • 1970-01-01
      • 2012-02-10
      • 1970-01-01
      • 1970-01-01
      • 2015-11-02
      相关资源
      最近更新 更多