给定多个输入的R ggplot直方图答案

【问题标题】：R ggplot histogram given multiple inputs给定多个输入的R ggplot直方图
【发布时间】：2012-08-10 15:50:05
【问题描述】：

我在 R 中偶然发现了一个问题，我希望有人能弄清楚它发生的原因以及如何解决它。我对 R 的使用没有很好的审查，有时我会感到困惑，因为一行代码通常可以比许多其他语言做更多的事情。问题似乎是程序在第一次之后没有正确地获取文件输入。如果我输入一个文件，直方图会以我期望的方式出现。但不幸的是，当输入多个文件时，它会将它们组合在一起并将它们放在第一个文件旁边。我宁愿每个输入文件都有自己独立的直方图。很抱歉这篇长文，但我试图提供尽可能多的信息以使我的代码可重现（我似乎不擅长重现代码）。

代码是这样的：

library("tcltk")
#choose any number of files
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
#read the tables
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
#use the 14th columns data
tmp <- stack(lapply(dat,function(x) x[,14]))
#this is where the histogram is made(with percent shown on the y axis)
require(ggplot2)
ggplot(tmp,aes(x = values)) + 
    facet_wrap(~ind) +
    geom_histogram(aes(y=..count../sum(..count..)))
dput(tmp)
dput(dat)
sessionInfo()

这是一个用户可以选择的文件示例：

Targ  cov  av_cov  87A_cvg  87Ag  87Agr  87Agr  87A_gra  87A%_1   87A%_3   87A%_5   87A%_10  87A%_20  87A%_30 87A%_40   87A%_50 87A%_75 87A%_100
1:028 400   0.42    400 0.42    1   1   2   41.8    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1:296 400   0.42    400 0.42    1   1   2   41.8    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1:453 1646  8.11    1646    8.11    7   8   13  100.0   100.0   87.2    32.0    0.0 0.0 0.0 0.0 0.0 0.0
1:427 1646  8.11    1646    8.11    7   8   13  100.0   100.0   87.2    32.0    0.0 0.0 0.0 0.0 0.0 0.0
1:736 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    0.0 0.0
1:514 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    0.0 0.0
1:296 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    0.0 0.0
1:534 400   0.42    400 0.42    1   1   2   41.8    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

还有一个：

Targ  cov  av_cov  87A_cvg  87Ag  87Agr  87Agr  87A_gra  87A%_1   87A%_3   87A%_5   87A%_10  87A%_20  87A%_30 87A%_40   87A%_50 87A%_75 87A%_100
    1:028 400   0.42    400 0.42    1   1   2   41.8    0.0 1.0 0.0 20.0    0.0 0.0 0.0 0.0 0.0
    1:296 400   0.42    400 0.42    1   1   2   41.8    0.0 20.0    0.0 40.0    0.0 100.0   10.0    50.0    4.0
    1:453 1646  8.11    1646    8.11    7   8   13  100.0   100.0   87.2    32.0    0.0 100.0   4.0 60.0    30.0    20.0
    1:427 1646  8.11    1646    8.11    7   8   13  100.0   100.0   87.2    32.0    0.0 80.0    40.0    60.0    80.0    90.0
    1:736 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    30.0    20.0
    1:514 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    20.0    30.0
    1:296 5105  29.68   5105    29.68   14  29  48  100.0   100.0   100.0   86.0    65.7    49.4    35.5    16.9    20.0    30.0
    1:534 400   0.42    400 0.42    1   1   2   41.8    0.0 40.0    30.0    80.0    70.0    40.0    30.0    30.0    10.0

该代码适用于一个文件（这些直方图是由不同的输入文件制作的，但您会看到图片）但不同意多个文件（无论数量如何）：一：

这就是我希望所有直方图的外观，每个输入文件一个。可惜... 多个文件：

> dput(tmp)
structure(list(values = c(0, 0, 0, 0, 49.4, 49.4, 49.4, 0), ind = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "f1", class = "factor")), .Names = c("values", 
"ind"), row.names = c(NA, -8L), class = "data.frame")
> dput(dat)
structure(list(f1 = structure(list(Targ = structure(c(1L, 2L, 
4L, 3L, 7L, 5L, 2L, 6L), .Label = c("1:028", "1:296", "1:427", 
"1:453", "1:514", "1:534", "1:736"), class = "factor"), cov = c(400L, 
400L, 1646L, 1646L, 5105L, 5105L, 5105L, 400L), av_cov = c(0.42, 
0.42, 8.11, 8.11, 29.68, 29.68, 29.68, 0.42), "X87A_cvg", "X87Ag", "X87Agr", "X87Agr.1", "X87A_gra", "X87A._1", "X87A._3", "X87A._5", "X87A._10", "X87A._20", "X87A._30", "X87A._40", 
"X87A._50", "X87A._75", "X87A._100"), class = "data.frame", row.names = c(NA, 
-8L))), .Names = "f1")
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
    locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods  
[8] base     
    other attached packages:
[1] ggplot2_0.9.1
    loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       grid_2.14.1       
 [5] labeling_0.1       MASS_7.3-17        memoise_0.1        munsell_0.3       
 [9] plyr_1.7.1         proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1    
[13] scales_0.2.1       stringr_0.6

有没有办法让每个直方图分开，并且能够独立存在？提前致谢斯蒂芬

【问题讨论】：

您是否只想在直方图的 y 轴上使用不同的比例？那是你要的吗？我的意思是，您正在将所有文件绘制在一个图中，因为您堆叠了所有数据并且您将其全部传递给 ggplot 以绘制所有数据。如果您确实想要单独的比例，可以这样做，但是如果您想要单独的情节，为什么要将它们像一个多面数据集一样封装？您无需刻面即可在单个设备上获得多个基于网格的图。
@GavinSimpson 嗯，说实话，它是一个多面数据集的原因是因为我是 R 新手。最初我使用网格包和直方图来制作单独的直方图，但是当我试图学习如何从用户那里获取多个文件输入时，有人建议我改成这个，我对此知之甚少。因此，我为什么在这里。对不起，如果这个问题对你来说很繁琐。我只是希望得到一些指导。不过感谢您的观看。
您的 dput(dat) 正在输出导致我的 Linux 机器上的数据帧损坏的内容。你确定这是可重现的吗？鉴于您的评论回复，我严肃地说，建议您在尝试跑步之前先学会走路。您在这里涉足一些相当先进的概念。开始简单，不要堆叠数据并使用基础图形来绘制。这样你的麻烦就会少很多。一旦你了解了发生了什么并学习了一点 ggplot，那么使用这些更高级的构造和更高级别的绘图包你会感觉更舒服。
@GavinSimpson 可悲的是，我以为会这样。我一直在徒劳地与它作斗争，因为它会破坏一天的工作。不过，谢谢您的帮助。
看看我的新答案是否有帮助。

标签： r histogram ggplot2

【解决方案1】：

鉴于您的dat 在我的系统上为dat 返回了一个损坏的数据帧，这里是使用带有虚拟数据的基本 R 的一种更简单的方法。

## fake a list of data frames, here, 4, each with two columns
dat <- list(file1 = data.frame(X = runif(20), Y = rnorm(20)),
            file2 = data.frame(X = runif(20), Y = runif(20)),
            file3 = data.frame(X = runif(20),
                               Y = rnorm(20) + rnorm(20, mean = 2, sd = 2)),
            file4 = data.frame(X = runif(20), Y = rnorm(20, mean = 4)))

## extract the second column from each
## (this is the same as your code extracting the 14 column)
tmp <- lapply(dat, `[[`, 2)

现在看看我们有什么：

R> str(tmp)
List of 4
 $ file1: num [1:20] -1.0225 -0.0302 -0.0987 1.977 0.2579 ...
 $ file2: num [1:20] 0.84583 0.49525 0.12287 0.43929 0.00132 ...
 $ file3: num [1:20] 2.03 5.27 1.57 2.72 1.12 ...
 $ file4: num [1:20] 4.54 4.08 4.28 4.48 6.36 ...

所以尝试绘制tmp的第一个组件：

hist(tmp[[1]])

好的，这样就可以了。现在我们知道我们可以绘制所有组件。以下是几种方法：

layout(matrix(1:4, ncol = 2))
for(p in seq_along(tmp)) {
    hist(tmp[[p]])
}
layout(1)

或者使用lapply()为我们做循环

layout(matrix(1:4, ncol = 2))
lapply(tmp, function(x) {hist(x); invisible()})
layout(1)

两者都生成如下内容：

显然我们可以更好地定制情节轴标签和标题，但我将其留给读者作为练习。

【讨论】：

这太棒了。特别是因为它很容易理解。我非常感谢。我会按照你的建议去做，在我跑步之前试着走路。再次感谢！

【解决方案2】：

这是因为您使用的是facet_wrap()。如果您希望每个输入有一个绘图，那么您必须创建一个循环

library("tcltk")
#choose any number of files
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
#read the tables
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
#use the 14th columns data
tmp <- stack(lapply(dat,function(x) x[,14]))
#this is where the histogram is made(with percent shown on the y axis)
gHist <- function(df){
   require(ggplot2)
   # New page so it doesn't overplot previous graphs
   grid.newpage()
   ggplot(df,aes(x = values)) + 
      geom_histogram(aes(y=..count../sum(..count..)))+
      # Add a tible
      opts(title = unique(df$ind))
}
# Split gives a list of the data.frame splited by ind
# Then lapply will cycle through the list and
# apply the function to each piece
lapply(split(tmp, tmp$ind), gHist)

您只提供了一张地块的数据，所以我只制作了一张。 R 抱怨 dput(dat) 已损坏。

【讨论】：

我添加了另一个文件以备不时之需（正如我所提到的，我在确保我的代码可重现方面很糟糕）。在实现您提供的代码时，只生成一个直方图。您介意为我解释一下 opts 和 lapply 行吗？
@Stephopolis 如果您想确保您的代码可重现，请打开一个新的 R 会话并运行您刚刚发布的代码。这样你就知道了。
@Iselzer 啊。当然。出于某种原因，最合乎逻辑的解决方案从来都不是我想到的第一个。但经验正在改变这一点。非常感谢
嗯，这似乎没有帮助。它只是制作空白页，然后制作相同的直方图。这可能是由于我通过命令行从 Unix 机器运行 R 造成的吗？
@Stephopolis 关于可重现的例子:: stackoverflow.com/questions/5963269/… 。 Github 的gist.github.com 可能会立即受益。