计算 R 中每个 ID 的每年中位数并绘制结果答案

【问题标题】：calculating medians per year per ID in R and plotting the outcome计算 R 中每个 ID 的每年中位数并绘制结果
【发布时间】：2020-10-23 02:01:41
【问题描述】：

数据集：

structure(list(ID = c(1234, 1234, 1234, 1234, 1234, 1234, 1234, 
1234, 8769, 8769, 8769, 8769, 8769, 7457, 7457, 7457, 7457, 7457, 
7457, 55667, 55667, 55667, 55667, 55667, 55667, 55667, 3789, 
3789, 3789, 3789, 3789, 3789), date_of_bloods = structure(c(978307200, 
981072000, 1173052800, 1175731200, 1367798400, 1465171200, 1467936000, 
1659916800, 1072915200, 1075680000, 1173052800, 1175731200, 1367798400, 
978307200, 981072000, 1173052800, 1175731200, 1367798400, 1465171200, 
978307200, 981072000, 1173052800, 1270425600, 1273104000, 1465171200, 
1467936000, 1270425600, 1367798400, 1465171200, 1465257600, 1465344000, 
1465430400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    result = c(90, 80, 60, 40, 25, 22, 22, 21, 70, 65, 43, 23, 
    22, 90, 90, 88, 86, 76, 74, 58, 46, 35, 34, 33, 30, 24, 76, 
    67, 56, 34, 33, 23), `mutation type` = c(1, 1, 1, 1, 1, 1, 
    1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 
    3, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -32L), class = "data.frame")

我希望每个 ID 每年的结果中位数采用年份仅为 0、1、2、3 等格式的格式，以确保群组之间的一致性，然后绘制这些线条并显示其突变类别的一些指示。

我已经完成了：

filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)

实际年份无关紧要，为了统一起见，我希望它们是每个 ID 的 0、1、2、3、4 等。我不知道该怎么做？然后，我想绘制这些数据（每个 ID 的年中位血量），并以某种形式了解它们属于哪个突变类别。

我希望这不是一个太宽泛的问题。

非常感谢

【问题讨论】：

标签： r dataframe lm

【解决方案1】：

你可以试试这个（filtered 是你包含的dput()）。我希望这会有所帮助：

library(dplyr)
library(lubridate)
library(ggplot2)
library(broom)
#Data
filtered %>% mutate(year=year(date_of_bloods)) %>%
group_by(ID,year,`mutation type`) %>% summarise(med=median(result)) -> df1
#Variables
df1 %>% ungroup()%>% mutate(ID=as.factor(ID),
                            year=as.factor(year),
                            `mutation type`=as.factor(`mutation type`)) -> df1
#Plot
ggplot(df1,aes(x=ID,y=med,fill=`mutation type`,color=year,group=year))+
  geom_line()

对于模型：

#Models
fits <- df1 %>%group_by(ID) %>% 
  do(fitmodel = lm(med ~ year, data = .))
#Coefs
dfCoef = tidy(fits, fitmodel)


# A tibble: 10 x 6
# Groups:   ID [5]
      ID term        estimate std.error statistic p.value
   <dbl> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1  1234 (Intercept)  6329.    1546.         4.09  0.0264
 2  1234 year           -3.13     0.769     -4.07  0.0268
 3  3789 (Intercept) 14318.    4746.         3.02  0.204 
 4  3789 year           -7.08     2.36      -3.00  0.205 
 5  7457 (Intercept)  2409.     403.         5.98  0.0269
 6  7457 year           -1.16     0.201     -5.78  0.0287
 7  8769 (Intercept)  9268.    4803.         1.93  0.304 
 8  8769 year           -4.60     2.39      -1.92  0.306 
 9 55667 (Intercept)  3294.     759.         4.34  0.0492
10 55667 year           -1.62     0.378     -4.29  0.0503

所需情节的代码：

#Plot 2
#Data modifications
df1 %>% mutate(year2=as.numeric(year)-1) -> df2
df2 %>% mutate(year2=factor(year2,levels = sort(unique(year2)))) -> df2
#Plot 2
ggplot(df2,aes(x=year2,y=med,color=ID,group=ID))+
  facet_wrap(.~`mutation type`)+
  geom_line()

【讨论】：

感谢您花时间回答！直到生成完全符合我需要的df1。但是，然后我想从最早的一年到最近的年份将年份变成 0、1、2、3 等，然后创建一个图，其中 x 是数字 0-> n 的转换年份，y 轴是该年份的中位数年份，每一行都有不同的 ID。如果可以将这些行编码为突变类型，那将是完美的。
@tacrolimus 你好。我已经为你想要的情节添加了一个新代码。我希望这会有所帮助。
亲爱的@Duck 再次感谢您的帮助。 df 部分再次完美运行。然而，实际上我的数据集中有超过 200 个 ID，我得到一个带有很多 ID 的图作为单独的图。如果我只是想简单地绘制 x=year2 y=median per ID 我该怎么做？情节是否会自动生成隐藏实际主情节的密钥？
@tacrolimus 你可以试试facet_wrap(.~ID) 或者去掉包装和color=ID 和group=ID
@tacrolimus 我的意思是保持color=ID 和group=ID 你可以玩它。我相信困难的部分已经完成。如果我能帮助你，请告诉我。

【解决方案2】：

你的命名结构不清楚，如果你提供的数据叫df那么你可以这样做：

df$year <-format(df$date_of_bloods,format="%Y")
aggregate(result ~ year + ID, data = df, median)

   year    ID result
1  2001  1234   85.0
2  2007  1234   50.0
3  2013  1234   25.0
4  2016  1234   22.0
5  2022  1234   21.0
6  2010  3789   76.0
7  2013  3789   67.0
8  2016  3789   33.5
9  2001  7457   90.0
10 2007  7457   87.0
11 2013  7457   76.0
12 2016  7457   74.0
13 2004  8769   67.5
14 2007  8769   33.0
15 2013  8769   22.0
16 2001 55667   52.0
17 2007 55667   35.0
18 2010 55667   33.5
19 2016 55667   27.0

【讨论】：