【问题标题】:Get mean of column values based on another column -R (dpylr)基于另一列-R(dplyr)获取列值的平均值
【发布时间】:2021-06-09 14:47:44
【问题描述】:

一个简单的工作流程如下:

  • 对于每个实体,获取“PROD_OIL”列的first 3 non-null values
  • 计算'FORCAST_PROD_OIL'列对应值的mean;忽略NA's(如果有)。

输入:

structure(list(entity= c("A", "A", "A", "A", "A", "A", "A", 
"A"), REPORT_DATE = structure(c(1623110400, 1623024000, 1622937600, 
1622851200, 1622764800, 1622678400, 1622592000, 1622505600), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), PROD_OIL = c("NA", "NA", "265.85000000000002", 
"NA", "272.45999999999998", "NA", "262.32", "NA"), PROD_GAS = c("NA", 
"NA", "2940.78", "NA", "2947.35", "NA", "3237.78", "NA"), FORECAST_PROD_OIL = c(283.71353, 
284.29868, 284.88622, 285.47615, 286.06849, 286.66326, 287.26047, 
287.86013), FORECAST_PROD_GAS = c(3038.99083, 3042.47991, 3045.97701, 
3049.48216, 3052.99539, 3056.51672, 3060.04619, 3063.58382)), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

我写了这个简单的dpylr 命令,但我没有得到正确的平均值。

AvgLast3WT <- dt%>%
  dplyr::arrange(entity,desc(REPORT_DATE))%>%
  dplyr::group_by(entity) %>% 
  dplyr::select(entity,REPORT_DATE,PROD_OIL,PROD_GAS,FORECAST_PROD_OIL, FORECAST_PROD_GAS)%>%
  dplyr::summarise(GetMean= mean(na.omit(with(dt, FORECAST_PROD_OIL[!is.na(PROD_OIL)])[1:3])))%>%
  ungroup()

答案应该是 286.07(下面红细胞的平均值),但我得到 285.4!

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这应该可行:

    df %>% 
      filter(!is.na(PROD_OIL)) %>% 
      group_by(entity) %>% 
      head(3) %>% 
      summarise(Mean=mean(FORECAST_PROD_OIL, na.rm=TRUE))
    

    并给出 284.2995 的值。但是您的示例数据不包含您的图像所暗示的值:

    df %>% 
      filter(!is.na(PROD_OIL)) %>% 
      group_by(entity) %>% 
      head(3) %>% 
      pull(FORECAST_PROD_OIL)
    [1] 283.7135 284.2987 284.8862
    

    【讨论】:

      【解决方案2】:
      df %>%
        filter(PROD_OIL != "NA") %>%
        group_by(entity) %>%
        top_n(3) %>%
        summarise(Mean = mean(FORECAST_PROD_OIL)) %>%
        as.data.frame()
      

      给出:

      Selecting by FORECAST_PROD_GAS
        entity     Mean
      1      A 286.0717
      

      在您提供的结构中,NA 是字符串而不是实际的 NA 值,如果在您的 df 中它们是真正的 NA 值,请将 PROD_OIL != "NA" 替换为 !is.na(PROD_OIL)

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2016-12-07
        • 1970-01-01
        • 1970-01-01
        • 2015-04-28
        • 1970-01-01
        • 1970-01-01
        • 2021-05-10
        • 1970-01-01
        相关资源
        最近更新 更多