【问题标题】:How to get conditional weighted means for several columns如何获得多列的条件加权平均值
【发布时间】:2014-05-08 14:06:56
【问题描述】:

对于以下数据框:

eu <- structure(list(land = structure(c(1L, 4L, 5L, 12L, 9L, 13L, 16L, 18L, 27L, 10L, 25L, 21L, 28L, 19L, 8L, 26L, 6L, 3L, 15L, 14L, 11L, 17L, 20L, 23L, 24L, 2L, 22L, 7L), .Label = c("Belgie", "Bulgarije", "Cyprus", "Denemarken", "Duitsland", "Estland", "Europese Unie", "Finland", "Frankrijk", "Griekenland", "Hongarije", "Ierland", "Italie", "Letland", "Litouwen", "Luxemburg", "Malta", "Nederland", "Oostenrijk", "Polen", "Portugal", "Roemenie", "Slovenie", "Slowakije", "Spanje", "Tsjechie", "Verenigd Koninkrijk", "Zweden"), class = "factor"), `1979` = c(91.36, 47.82, 65.73, 63.61, 60.71, 85.65, 88.91, 58.12, 32.35, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 61.99), `1981` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 81.48, NA, NA, NA, NA, NA, NA, NA,  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1984` = c(92.09, 52.38, 56.76, 47.56, 56.72, 82.47, 88.79, 50.88, 32.57, 80.59, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 58.98), `1987` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 68.52, 72.42, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1989` = c(90.73, 46.17, 62.28, 68.28, 48.8, 81.07, 87.39, 47.48, 36.37, 80.03, 54.71, 51.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 58.41), `1994` = c(90.66, 52.92, 60.02, 43.98, 52.71, 73.6, 88.55, 35.69, 36.43, 73.18, 59.14, 35.54, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 56.67), `1995` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 41.63, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1996` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.73, 57.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1999` = c(91.05, 50.46, 45.19, 50.21, 46.76, 69.76, 87.27, 30.02, 24, 70.25, 63.05, 39.93, 38.84, 49.4, 30.14, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 49.51), `2004` = c(90.81, 47.89, 43, 58.58, 42.76, 71.72, 91.35, 39.26, 38.52, 63.22, 45.14, 38.6, 37.85, 42.43, 39.43, 28.3, 26.83, 72.5, 48.38, 41.34, 38.5, 82.39, 20.87, 28.35, 16.97, NA, NA, 45.47), `2007` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29.22, 29.47, NA), `2009` = c(90.39, 59.54, 43.3, 58.64, 40.63, 65.05, 90.75, 36.75, 34.7, 52.61, 44.9, 36.78, 45.53, 45.97, 40.3, 28.2, 43.9, 59.4, 20.98, 53.7, 36.31, 78.79, 24.53, 28.33, 19.64, 38.99, 27.67, 43), inwoners = c(11161642, 5602628, 80523746, 4591087, 65578819, 59685227, 537039, 16779575, 63896071, 11062508, 46727890, 10487289, 9555893, 8451860, 5426674, 10516125, 1320174, 865878, 2971905, 2023825, 9908798, 421364, 38533299, 2058821, 5410836, 7284552, 20020074, 501403599), plicht = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ja", "nee"), class = "factor")), .Names = c("land", "1979", "1981", "1984", "1987", "1989", "1994", "1995", "1996", "1999", "2004", "2007", "2009", "inwoners", "plicht"), row.names = c(NA, -28L), class = "data.frame")

我需要条件列的意思。我可以这样做:

verplicht <- c("Europese Unie (stemplicht)", colMeans(eu[eu$plicht=="ja",c(2:13)], na.rm=TRUE), NA)
vrij <- c("Europese Unie (geen stemplicht)", colMeans(eu[eu$plicht=="nee",c(2:13)], na.rm=TRUE), NA)
eu2 <- rbind(eu, verplicht, vrij)

但是,我需要以国家人口(@98​​7654323@ 列)作为权重的加权列均值。我尝试过:

verplicht <- c("Europese Unie (stemplicht)", lapply(eu[eu$plicht=="ja",c(2:13)], weighted.mean(x, eu[eu$plicht=="ja",14], na.rm=TRUE)), NA)

但这导致了以下错误:

Error in weighted.mean.default(x, eu[eu$plicht == "ja", 14], na.rm = TRUE) : 
  'x' and 'w' must have the same length

我了解错误消息的含义,但不知道如何解决。有什么建议吗?

【问题讨论】:

    标签: r mean weighted-average


    【解决方案1】:

    问题在于您如何使用lapply。这是正确的代码:

    lapply(eu[eu$plicht=='ja',2:13], weighted.mean, eu[eu$plicht=='ja','inwoners'], na.rm=TRUE)
    lapply(eu[eu$plicht=='nee',2:13], weighted.mean, eu[eu$plicht=='nee','inwoners'], na.rm=TRUE)
    

    注意weighted.mean 是如何用作参数的,而不是在匿名函数中以x 作为参数。你可以等效地做:

    lapply(eu[eu$plicht=='ja',2:13], function(x) weighted.mean(x, eu[eu$plicht=='ja','inwoners'], na.rm=TRUE))
    lapply(eu[eu$plicht=='nee',2:13], function(x) weighted.mean(x, eu[eu$plicht=='nee','inwoners'], na.rm=TRUE))
    

    但您目前正在混合使用lapply 的两种不同方式。

    【讨论】:

    • 谢谢!使用verplicht &lt;- c("Europese Unie (stemplicht)", lapply(eu[eu$plicht=='ja',2:13], weighted.mean, eu[eu$plicht=='ja','inwoners'], na.rm=TRUE), NA, NA),我得到一个包含 15 个元素的列表。但是,当我想将euverplichteu2 &lt;- rbind(eu, verplicht) 组合成一个新的df 时,我得到:Error in match.names(clabs, nmi) : names do not match previous names。在我第一次尝试colMeans 时,这很有效。任何想法如何解决这个问题?
    • @Jaap 尝试使用sapply 而不是lapplylapply 返回一个列表,而 colMeans 返回一个向量,这可能是导致问题的原因。
    • 或与lapply一起工作rbind(eu,setNames(as.data.frame(verplicht),colnames(eu)))
    • @Thomas 这只能部分解决。在"Europese Unie (stemplicht)" 部分之前,现在包括NA
    • @Vivek 也会导致第一列的NA
    【解决方案2】:

    如果inwoners是人口,那么

    > (weights <- with(eu, inwoners/sum(inwoners)))
    #  [1] 0.0111303968 0.0055869443 0.0802983327 0.0045782350 0.0653952416 
    #  [6] 0.0595181478 0.0005355356 0.0167326033 0.0637172042 0.0110315403 
    # [11] 0.0465970828 0.0104579315 0.0095291428 0.0084282004 0.0054114829
    # [16] 0.0104866868 0.0013164784 0.0008634541 0.0029635856 0.0020181596 
    # [21] 0.0098810599 0.0004201845 0.0384254312 0.0020530577 0.0053956892 
    # [26] 0.0072641601 0.0199640310 0.5000000000
    

    例如,2004 列的加权平均值为

    > weighted.mean(eu$`2004`, w = weights, na.rm = TRUE)
    # [1] 45.31782
    

    要获得plicht == 'ja' 时每个年份列的加权平均值,

    > s <- subset(eu, plicht == "ja")
    > w2 <- weights[as.numeric(rownames(s))]
    > newDF <- do.call(rbind, lapply(2:13, function(i){
          data.frame(wtMean.ja = weighted.mean(s[,i], w = w2, na.rm = TRUE))
          }))
    > rownames(newDF) <- names(s)[2:13]
    > newDF
    #      wtMean.ja
    # 1979  86.56735
    # 1981  81.48000
    # 1984  83.56127
    # 1987  68.52000
    # 1989  72.30636
    # 1994  69.86950
    # 1995       NaN
    # 1996       NaN
    # 1999  69.28708
    # 2004  63.17060
    # 2007       NaN
    # 2009  58.99465
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-09-03
      • 2012-03-18
      • 2022-08-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多