【问题标题】:Use ddply() to aggregate relative histogram counts使用 ddply() 聚合相对直方图计数
【发布时间】:2013-01-31 10:10:04
【问题描述】:

与我之前提出的问题 (ggplot2 how to get 2 histograms with the y value = to count of one / sum of the count of both) 相关,我尝试编写一个函数,该函数将 data.frame 作为输入,其中包含多个参与者在几种情况下的响应时间 (RT) 和准确性(正确),并输出一个“汇总”data.frame,其中的数据像直方图一样聚合。这里的特殊性是我不想得到每个 bin 中响应的绝对数量,而是相对计数。

我所说的相对计数是对于直方图的每个bin,该值对应于:

relative_correct   = ncorrect / sum(ncorrect+nincorrect)
relative_incorrect = nincorrect / sum(ncorrect+nincorrect)

结果实际上接近于密度图,只是它不是每条曲线的总和等于 1,而是正确和错误曲线的总和。

这是创建示例数据的代码:

# CREATE EXAMPLE DATA
subjectname <- factor(rep(c("obs1","obs2"),each=50))
Visibility  <- factor(rep(rep(c("cond1","cond2"),each=25),2)) 
RT          <- rnorm(100,300,50)
correct     <- sample(c(rep(0,25),rep(1,75)),100)
my.data <- data.frame(subjectname,Visibility,RT,correct)

首先我需要定义一个稍后在 ddply 中使用的函数

histRTcounts <- function(df) {out = hist(df$RT, breaks=seq(5, 800, by=10), plot=FALSE)
                          out = out$counts}

然后是 main 函数(有 2 个小问题阻止它在函数内部工作,请参见带有 ????? 的行,但在函数外部,此代码有效)。

relative_hist_count <- function(df, myfactors) {
  require(ggplot2)
  require(plyr)
  require(reshape2)

  # ddply it to get one column for each bin of the histogram
  myhistRTcounts <- ddply(df, c(myfactors,"correct"), histRTcounts)

  # transform it in long format
  myhistRTcounts.long = melt(myhistRTcounts, id.vars =c(myfactors,"correct"), variable.name="bin", value.name = 'mycount')

  # rename the bin names with the ms value they correspond to
  levels(myhistRTcounts.long$bin) <- seq(5, 800, by=10)[-1]-5

  # make them numeric and not a factor anymore
  myhistRTcounts.long$bin = as.numeric(levels(myhistRTcounts.long$bin))[myhistRTcounts.long$bin]

  # cast to have count_correct and count_incorrect as columns
  # ??????????????????????? problem when putting that into a function
  # Here I was not able to figure out how to combine myfactors to the other variables in the call
  myhistRTcount.short = dcast(myhistRTcounts.long, subjectname + Visibility + bin ~ correct)
  names(myhistRTcount.short)[4:5] <- c("countinc","countcor")

  # compute relative counts
  myhistRTcounts.rel <- ddply(myhistRTcount.short, myfactors, transform, 
                          incorrect = countinc / sum(countinc+countcor),
                          correct = countcor / sum(countinc+countcor)
  )
  myhistRTcounts.rel = subset(myhistRTcounts.rel,select=c(-countinc,-countcor))

  myhistRTcounts.rel.long = melt(myhistRTcounts.rel, id.vars = c(myfactors,"bin"), variable.name = 'correct', value.name = 'mycount')

  # ??????????????????????? idem here, problem when putting that into a function to call myfactors
  ggplot(data=myhistRTcounts.rel.long, aes(x=bin, y=mycount, color=factor(correct))) + geom_line() + facet_grid(Visibility ~ subjectname) + xlim(0, 600) + theme_bw()

  return(myhistRTcounts.rel.long)

将其应用于数据的调用

new.df = relative_hist_count(my.data, myfactors = c("subjectname","Visibility"))

所以首先,我需要你的帮助才能使它作为一个函数工作,并有可能在 dcast() 和 ggplot() 中使用 myfactors 变量。

但更重要的是,我几乎可以肯定,这个函数可以用更少的步骤以更优雅、最直接的方式编写。

提前感谢您的帮助!

【问题讨论】:

    标签: r ggplot2 histogram plyr reshape2


    【解决方案1】:

    也许这有助于设置数据?

    countfun <- function(x,...) {
      res <- hist(x,plot=FALSE,...)
      data.frame(counts=res$counts,
                 break1=res$breaks[-length(res$breaks)],
                 break2=res$breaks[-1])
    }
    
    library(plyr)
    plot.dat <- ddply(my.data,.(Visibility),function(df){
      res <- ddply(df,.(correct),function(df2) {countfun(df2$RT,breaks=seq(100, 600, by=10))})
      res$freq2 <- res$counts/nrow(df)
      res
    })
    

    您可能需要将整个 parseevalas.formula 内容推广到任意因素。我现在没有时间。

    但是,如果您打算进行概括,最好修改 hist 函数以接受一个参数以用作计数的因素。

    【讨论】:

    • 感谢罗兰的回答。我写了一个新的“hist”函数来完成我想要的工作。我会在 stackexchange 允许时发布解决方案(似乎像我这样的新用户需要等待 8 小时才能发布答案)。
    【解决方案2】:

    感谢 Roland,我没想过要写一个自制的 hist 函数。请在下面找到它:

    RelativeHistRT <- function (df, breaks = seq(5,800,10)) 
    {
      distrib.correct   = hist(df$RT[df$correct==1], breaks, right=FALSE, plot=FALSE)
      distrib.incorrect = hist(df$RT[df$correct==0], breaks, right=FALSE, plot=FALSE)
    
      n.total = sum(distrib.correct$counts) + sum(distrib.incorrect$counts)
    
      data.frame(bin_mids  = distrib.correct$mids,
             correct   = distrib.correct$counts / n.total,
             incorrect = distrib.incorrect$counts / n.total)
    }
    

    并将其应用于我的原始 data.frame 并获得我正在寻找的内容:

    myhistRTcounts <- ddply(my.data, .(subjectname,Visibility), RelativeHistRT)
    

    这确实要短得多,并且完全符合我的要求。

    【讨论】:

    • 您为什么不将您的答案标记为“已接受”以清楚地表明问题已得到解决?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-24
    • 2019-06-05
    相关资源
    最近更新 更多