【问题标题】:Compute mean across variable using R使用 R 计算跨变量的平均值
【发布时间】:2020-01-10 19:28:37
【问题描述】:

我在创建一个数据集时遇到了一些麻烦,该数据集在我尝试的代码下方,变量级别的平均中位数为 25% 和 75%(在我的情况下,变量是数据集 df1 中的危机_t)。问题是百分位数没有正确计算,我不明白为什么。任何想法 ?

#what I have
country <- c("AT","AT","AT","AT","BE","BE","BE","BE","DE","DE","DE")
crisis_t  <- c(-1,0,1,2,-1,0,1,2,0,1,2)
value1  <- c(0.01,0.02,0.015,0.03,0.5,0.55,0.7,0.4,0.01,0.02,0.04)

df1 <- data.frame(country, crisis_t,value1)

#what I would like to obtain

crisis_t <- c(-1,0,1,2)
mean_t   <- c(0.255,0.193,0.245,0.156)
median_t <- c(0.255,0.02,0.02,0.04)
perc_25  <- c(NA,0.01,0.015,0.03)
perc_75  <- c(NA,0.55,0.7,0.4)

df2 <- data.frame(crisis_t, mean_t, median_t, perc_25, perc_75)

#my code does not compute correctly the 25th quantile
df1 <- as.data.table(df1)
df2_try <- data.table()
df2_try <- df1[,mean_t2:=mean(value1, na.rm=TRUE),by=.(crisis_t)]
df2_try <- df1[,median_t2:=median(value1, na.rm=TRUE),by=.(crisis_t)]
df2_try <- df1[,perc_25:=quantile(value1, probs=0.25),by=.(crisis_t)]
df2_try <- df1[,perc_75:=quantile(value1, probs=0.75),by=.(crisis_t)]

df2_try

感谢您的帮助。

编辑:实际数据集。

country       <- c("AT","AT","AT","AT","BE","BE","BE","BE","BE","BE","BE","DE","DE","DE")
crisis_AT_1   <- c(-1,0,1,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
crisis_BE_1   <- c(NA,NA,NA,NA,-1,0,1,2,3,4,5,6,NA,NA)
crisis_BE_2   <- c(NA,NA,NA,NA,-4,-3,-2,-1,0,1,2,-2,NA,NA)
crisis_DE_1   <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,0)
value1        <- c(0.01,0.02,0.015,0.03,0.5,0.55,0.7,0.4,0.01,0.02,0.04,0.02,0.14,0.21)

df3 <- data.frame(country, crisis_AT_1,crisis_BE_1,crisis_BE_2,crisis_DE_1,value1)

【问题讨论】:

    标签: r data.table


    【解决方案1】:

    默认情况下,quantile 函数将使用连续版本的分位数。这意味着如果您定义的分位数中没有数字,它将根据给定的经验分布估计应该在其中的数字。

    根据您的预期输出,您似乎需要 quantile type 2,它将在离散的经验分布上对分位数进行采样,但它会在不连续的中间进行平均。您可以按如下方式使用它:

    df1 <- as.data.table(df1)
    df2_try <- copy(df1)
    df2_try[,mean_t2:=  mean(value1),by=.(crisis_t)]
    df2_try[,median_t2:=quantile(value1, 0.50, type=2),by=.(crisis_t)]
    df2_try[,perc_25:=  quantile(value1, 0.25, type=2),by=.(crisis_t)]
    df2_try[,perc_75:=  quantile(value1, 0.75, type=2),by=.(crisis_t)]
    

    但是,这不会像您想要的那样返回 NA,因为最小值在分位数 0 中,最大值在分位数 1 中,分位数 25% 和 75% 确实具有与之关联的值。尽管如此,如果您真的需要,您可以通过ifelse 强制执行该行为。

    顺便说一句,您不需要在每次修改后分配df2_try。在data.table 中,您正在执行的突变已经到位(它们会改变对象本身)。所以你可以像我在例子中那样做。我使用data.table 中的copy 函数来获得原始data.table df1 和修改版本df2_try 的副本。

    【讨论】:

    • 鉴于每个国家/地区的危机不同,我编辑问题时遇到了一些困难。是否有可能获得相同的结果(危机期间的平均值)?
    • 我假设您需要 perc_25、mean、median 和 perc_75 用于每个国家的crisis_* 组。考虑到这一点,您需要的是 gather (tidyverse)melt (data.table) 危机列,然后做同样的事情,但按国家、危机类型和危机值列,
    猜你喜欢
    • 1970-01-01
    • 2017-12-05
    • 2013-12-28
    • 2021-06-21
    • 1970-01-01
    • 2019-10-27
    • 1970-01-01
    • 2012-04-05
    • 1970-01-01
    相关资源
    最近更新 更多