【问题标题】:R, ggplot, separate mean by range of x valueR,ggplot,按x值范围分开平均值
【发布时间】:2014-02-27 06:13:57
【问题描述】:

我有一组数据是这样的

  CHROM   POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167

其中 POS 的范围从 1xxxx 到 1xxxxxxx。 而 CHROM 是一个分类变量,包含“chr01”到“chr22”和“chrX”的值。

我要绘制散点图:

  • y(DIFF) 与 X(POS)
  • 面板由 CHROM 分隔
  • 按 GT 分组(按 GT 不同颜色)

我正在创建一个运行平均值的 ggplot(虽然不是时间序列数据)。

我想要的是通过 GT 获得每 1,000,000 个 POS 范围的平均值。

例如,

对于范围内的 x (1 ~ 1,000,000) ,DIFF 平均值 = _____

对于范围内的 x (1,000,001 ~ 2,000,000),DIFF 平均值 = _____

我想在 ggplot 上绘制水平线(由 GT 着色)。

#

到目前为止,我在应用您的功能之前所拥有的:

应用你的功能后:

我尝试将您的解决方案应用于我已有的解决方案,这里有一些问题:

  • 有不同的面板,因此不同面板的平均值不同,但是当我应用您的代码时,水平平均线都与第一个面板相同。
  • 我有不同的 x 轴范围,所以当应用你的函数时,它会自动用之前的水平平均线填充额外的范围

这是我之前的代码:

ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
  geom_point() +
  facet_grid(~ CHROM,scales="free_x",space="free_x") + 
  theme(strip.text.x = element_text(size=40),
        strip.background = element_rect(color='lightblue',fill='lightblue'),
        legend.position="top",
        legend.title = element_text(size=40,colour="darkblue"),
        legend.text = element_text(size=40),
        legend.key.size = unit(2.5, "cm")) +
  guides(fill = guide_legend(title.position="top",
                             title = "Legend:GT='REF'+'ALT'"),
         shape = guide_legend(override.aes=list(size=10))) +
  scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) + 
  scale_x_continuous(breaks = pretty_breaks(n=3))

【问题讨论】:

    标签: r ggplot2 mean


    【解决方案1】:

    这应该让你开始:

    # It saves a lot of headaches to just make factors as you need them
    options(stringsAsFactors = FALSE)
    
    
    
    library(ggplot2)
    library(plyr)
    
    # Here's some made-up data - it always helps if you can post a subset of
    # your real data, though. The dput() function is really useful for that.
    dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
    
    
    # Add random GT value
    dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
                     size = nrow(dat),
                     replace = TRUE)
    
    # Group by millions - there are several ways to do this that I can 
    # never remember, but here's a simple way to split by millions
    dat$POSgroup <- floor(dat$POS / 1e6)
    
    
    # Add an arbitrary DIFF value
    dat$DIFF <- rnorm(n = nrow(dat),
                      mean = 200 * dat$POSgroup,
                      sd = 300)
    
    
    
    # Aggregate the data by GT and POS-group
    # Ideally, you'd do this inside of the plot using stat_summary,
    # but I couldn't get that to work. Using two datasets in a plot 
    # is okay, though.
    datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
    
        # Calculate the mean DIFF value for each GT group in this POSgroup
        meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
                    
        # Add the center of the POSgroup range as the x position
        meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
    
        # Return the results
        meandiff
    
    })
    
    
    # On the plot, these results will be grouped by both POS and GT - but
    # ggplot will only accept one vector for grouping. So make a combination.
    datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
    
    
    # Plot it
    ggplot() +
    
        # First, a layer for the points themselves
        # Large numbers of points can get pretty slow - you might try getting
        # the plot to work with a subsample (~1000) and then add in the rest of
        # your data
        geom_point(data = dat, 
                   aes(x = POS, y = DIFF, color = as.factor(GT))) +
    
        # Then another layer for the means. There are a variety of geoms you could
        # use here, but crossbar with ymin and ymax set to the group mean
        # is a simple one
        geom_crossbar(data = datsum, aes(x = center, 
                                         y = ymean, 
                                         ymin = ..y.., 
                                         ymax = ..y.., 
                                         color = as.factor(GT),
                                         group = combogroup),
                      size = 1) +
    
    
        # Some other niceties
        scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
        labs(x = "POS", y = "DIFF", color = "GT") +
        theme_bw()
    

    结果如下:

    【讨论】:

    • 您好,它与我想要的非常接近,您能否阅读我编辑的问题并给我更多提示?我你想要原始数据,给我你的电子邮件,我可以发给你。谢谢!
    • 我现在没有时间扩展我的答案,但基本要点是修改 datsum 以计算 CHROMPOSgroup 和 @ 的每个组合的单独均值987654326@。我认为你应该能够通过CHROM 来分面我的情节以获得你想要的东西。祝你好运!
    猜你喜欢
    • 2015-06-04
    • 2023-04-03
    • 2018-05-12
    • 2020-04-18
    • 1970-01-01
    • 2019-12-05
    • 2020-03-11
    • 2014-01-22
    • 1970-01-01
    相关资源
    最近更新 更多