R cor 有时会返回 NaN答案

【问题标题】：R cor returns NaN sometimesR cor 有时会返回 NaN
【发布时间】：2014-12-31 23:31:57
【问题描述】：

我一直在处理一些数据，可在此处获得：Dropbox' csv file（请善意使用它来复制错误）。

当我运行代码时：

t<-read.csv("120.csv")
x<-NULL
for (i in 1:100){
  x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))
}
sum(is.nan(x))

我得到最后一个表达式的随机值，通常在 55 到 60 左右。我希望 cor 给出可重复的结果，所以我希望 x 是一个长度 = 100 的向量，由相同的值组成。例如，查看两次独立运行的输出：

> x<-NULL; for (i in 1:100){x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))}
> sum(is.nan(x))
[1] 62
> head(x,10)
 [1]       NaN       NaN 0.2967441       NaN 0.2967441       NaN       NaN       NaN
 [9] 0.2967441       NaN
> x<-NULL; for (i in 1:100){x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))}
> sum(is.nan(x))
[1] 52
> head(x,10)
 [1] 0.2967441       NaN       NaN       NaN       NaN 0.2967441 0.2967441       NaN
 [9] 0.2967441 0.2967441
>

我想知道我在这里做错了什么，或者它是否是一个[n] [un]已知的错误。如果是这样的话，如果有比我更精通的人帮助我向 CRAN 报告，我将不胜感激。

我阅读了一篇非常古老的（2001 年）帖子，其中 cor.test 表现出相同的行为（请参阅cor.test produces NaN sometimes。

感谢您的友好解释，因为我对 R 不屑一顾。谢谢！

根据 Ben 的建议：

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252    LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
[5] LC_TIME=Spanish_Colombia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_0.6.2     digest_0.6.4      RCurl_1.95-4.3    bitops_1.0-6      qpcR_1.4-0        Matrix_1.1-4      robustbase_0.91-1 rgl_0.95.1157    
 [9] minpack.lm_1.1-8  MASS_7.3-35       plyr_1.8.1        swirl_2.2.16      ggplot2_1.0.0     lattice_0.20-29  

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 DEoptimR_1.0-2   grid_3.1.1       gtable_0.1.2     httr_0.5         labeling_0.3     munsell_0.4.2    proto_0.3-10     Rcpp_0.11.3     
[10] reshape2_1.4     scales_0.2.4     testthat_0.9.1   tools_3.1.1      yaml_2.1.13

find("cor") 的结果：

> find("cor")
[1] "package:stats"

---------- ###第二次编辑###--------

我重新启动了会话（我不太了解如何传递 --vanilla 参数。我正在使用 Rstudio），这是新的 sessionInfo：

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252    LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
[5] LC_TIME=Spanish_Colombia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.1

我在新会话中再次运行命令，仍然得到 sum(is.nan(x))=52 :(

以防万一有用：

> cor
function (x, y = NULL, use = "everything", method = c("pearson", 
    "kendall", "spearman")) 
{
    na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", 
        "everything", "na.or.complete"))
    if (is.na(na.method)) 
        stop("invalid 'use' argument")
    method <- match.arg(method)
    if (is.data.frame(y)) 
        y <- as.matrix(y)
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    if (!is.matrix(x) && is.null(y)) 
        stop("supply both 'x' and 'y' or a matrix-like 'x'")
    if (!(is.numeric(x) || is.logical(x))) 
        stop("'x' must be numeric")
    stopifnot(is.atomic(x))
    if (!is.null(y)) {
        if (!(is.numeric(y) || is.logical(y))) 
            stop("'y' must be numeric")
        stopifnot(is.atomic(y))
    }
    Rank <- function(u) {
        if (length(u) == 0L) 
            u
        else if (is.matrix(u)) {
            if (nrow(u) > 1L) 
                apply(u, 2L, rank, na.last = "keep")
            else row(u)
        }
        else rank(u, na.last = "keep")
    }
    if (method == "pearson") 
        .Call(C_cor, x, y, na.method, FALSE)
    else if (na.method %in% c(2L, 5L)) {
        if (is.null(y)) {
            .Call(C_cor, Rank(na.omit(x)), NULL, na.method, method == 
                "kendall")
        }
        else {
            nas <- attr(na.omit(cbind(x, y)), "na.action")
            dropNA <- function(x, nas) {
                if (length(nas)) {
                  if (is.matrix(x)) 
                    x[-nas, , drop = FALSE]
                  else x[-nas]
                }
                else x
            }
            .Call(C_cor, Rank(dropNA(x, nas)), Rank(dropNA(y, 
                nas)), na.method, method == "kendall")
        }
    }
    else if (na.method != 3L) {
        x <- Rank(x)
        if (!is.null(y)) 
            y <- Rank(y)
        .Call(C_cor, x, y, na.method, method == "kendall")
    }
    else {
        if (is.null(y)) {
            ncy <- ncx <- ncol(x)
            if (ncx == 0) 
                stop("'x' is empty")
            r <- matrix(0, nrow = ncx, ncol = ncy)
            for (i in seq_len(ncx)) {
                for (j in seq_len(i)) {
                  x2 <- x[, i]
                  y2 <- x[, j]
                  ok <- complete.cases(x2, y2)
                  x2 <- rank(x2[ok])
                  y2 <- rank(y2[ok])
                  r[i, j] <- if (any(ok)) 
                    .Call(C_cor, x2, y2, 1L, method == "kendall")
                  else NA
                }
            }
            r <- r + t(r) - diag(diag(r))
            rownames(r) <- colnames(x)
            colnames(r) <- colnames(x)
            r
        }
        else {
            if (length(x) == 0L || length(y) == 0L) 
                stop("both 'x' and 'y' must be non-empty")
            matrix_result <- is.matrix(x) || is.matrix(y)
            if (!is.matrix(x)) 
                x <- matrix(x, ncol = 1L)
            if (!is.matrix(y)) 
                y <- matrix(y, ncol = 1L)
            ncx <- ncol(x)
            ncy <- ncol(y)
            r <- matrix(0, nrow = ncx, ncol = ncy)
            for (i in seq_len(ncx)) {
                for (j in seq_len(ncy)) {
                  x2 <- x[, i]
                  y2 <- y[, j]
                  ok <- complete.cases(x2, y2)
                  x2 <- rank(x2[ok])
                  y2 <- rank(y2[ok])
                  r[i, j] <- if (any(ok)) 
                    .Call(C_cor, x2, y2, 1L, method == "kendall")
                  else NA
                }
            }
            rownames(r) <- colnames(x)
            colnames(r) <- colnames(y)
            if (matrix_result) 
                r
            else drop(r)
        }
    }
}
<bytecode: 0x0000000008ce0158>
<environment: namespace:stats>

再次感谢。

【问题讨论】：

如果有人可以添加标签“cor”，我将不胜感激。我的声誉（仍然低于 1500）不允许我添加新标签，我认为这对于面临同样问题的人来说至关重要。谢谢！
FWIW 看起来旧问题已在 R 1.4.0 中修复（！）：cran.r-project.org/src/base/NEWS.1 说 cor(*, use = "all.obs") <= 1 is now guaranteed which ensures that sqrt(1 - r^2) is always ok in cor.test(). (PR#1099)
我无法复制；我总是得到sum(is.nan(x)) 等于零。 (1) 尝试在一个干净的 R 会话中开始（如果可能，使用--vanilla）； (2)sessionInfo()的结果？ (3)find("cor")的结果？
我怀疑这对你来说是个问题，不是一般的错误，但这是一个很好问的问题。
奇怪。我希望有人可以在 64 位 Windows 上运行的 3.1.1 上对此进行测试。一个错误似乎非常、非常、非常、不太可能，但我已经没有其他解释了。

标签： r nan correlation

【解决方案1】：

乔凡尼，它对我来说工作正常。也许您应该尝试将 cor 的参数更改为 use = "complete.obs" 看看是否有帮助。您还应该检查您的 CSV 文件是否已损坏。

我希望它有所帮助。

【讨论】：

为此类答案发布您的 R 版本和平台会有所帮助。
@Danish：谢谢你的建议。它仍然会产生不一致的结果。另一方面，如果 csv 已损坏，则会产生一致意外（或错误）结果。真正让我困惑的是不一致。

【解决方案2】：

虽然我很困惑，但我开始在 cor 的 use= 参数上玩各种选项。我发现如果我使用cor(t$nitrate,t$sulfate,use="pairwise.complete.obs")，我可以得到一致的结果：

> x<-NULL; for (i in 1:100){x<-c(x,cor(t$nitrate,t$sulfate,use="pairwise.complete.obs"))};x
  [1] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [12] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [23] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [34] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [45] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [56] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [67] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [78] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
 [89] 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441 0.2967441
[100] 0.2967441

我仍然不明白为什么其他用户传递的use 的其他选项没有导致奇怪的行为。

【讨论】：

正如@MartinMächler 所说，这更像是一种解决方法而不是解决方案，但可能会对某人有所帮助，所以我将其标记为 aswer。

【解决方案3】：

几个cmets和注释：

没有人能够重现您的问题
120.csv 文件不会有问题，一切都很好。
真的，使用另一个 use=".." 选项只是一种解决方法
R 源代码中的底层 C 代码在任何地方都使用 ISNAN(.) 来检测值是 NA 还是 NaN，这在术语中用于您的（系统内部）C 库的 isnan(.) 函数。
你（和只有你）有时会得到NaN，因为ISNAN(.)在某些情况下它应该返回“true”，并且浮点算术使用NA计算并正确返回NaN .

作为一个“老”的 R 核心成员，我可以向您保证 ISNAN(.) 被用于 R 核心计算中的许多基本位置，并且观察到 对您而言有时似乎不检测 NA/NaN 以便它们传播到结果中是非常有问题。正如邓肯默多克所说，回答你的 R 错误报告 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16058 这一定是您的特定“系统”的一个问题...... 我假设您只是从 CRAN 下载了 R，也适用于 R 3.1.2，但您仍然看到问题，我倾向于说您的系统软件 (Windows) 或 - 不太可能 - 您的硬件必须轻微损坏/损坏。

【讨论】：

感谢您的意见。我打算在这里链接到错误报告中的讨论，但你继续！ Duncan 建议 msvcrt.dll 库可能有问题或浮点处理器有问题。我从 octave 构建中读到了一些类似的抱怨，所以让我考虑（再一次）迁移到 Linux 或 OSX。但是，如果这确实是一个库问题，那么我很确定有些人没有意识到他们的系统中存在的问题。我希望有人能解决这个问题。同时，正如您所建议的，我会将自己的回复标记为解决方法。