如何在不删除 R 中存在 NA 的行的情况下执行聚类答案

【问题标题】：How to perform clustering without removing rows where NA is present in R如何在不删除 R 中存在 NA 的行的情况下执行聚类
【发布时间】：2013-12-24 15:18:25
【问题描述】：

我有一个数据，其元素中包含一些 NA 值。我想要做的是在不删除行的情况下进行聚类 NA 存在的地方。

我知道daisy 中的gower 距离测量允许这种情况。但是为什么我下面的代码不起作用？我欢迎“雏菊”以外的其他选择。

# plot heat map with dendogram together.

library("gplots")
library("cluster")


# Arbitrarily assigning NA to some elements
mtcars[2,2] <- "NA"
mtcars[6,7]  <- "NA"

 mydata <- mtcars

hclustfunc <- function(x) hclust(x, method="complete")

# Initially I wanted to use this but it didn't take NA
#distfunc <- function(x) dist(x,method="euclidean")

# Try using daisy GOWER function 
# which suppose to work with NA value
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

# Perform clustering heatmap
heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

我得到的错误信息是这样的：

    Error in which(is.na) : argument to 'which' is not logical
Calls: distfunc.g -> daisy
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In daisy(x, metric = "gower") :
  binary variable(s) 8, 9 treated as interval scaled
Execution halted

最后，我想使用 NA 允许的数据执行层次聚类。

更新

使用as.numeric 进行转换使用上面的示例。但是为什么从文本文件中读取这段代码会失败呢？

library("gplots")
library("cluster")

# This time read from file
mtcars <- read.table("http://dpaste.com/1496666/plain/",na.strings="NA",sep="\t")

# Following suggestion convert to numeric
mydata <- apply( mtcars, 2, as.numeric )

hclustfunc <- function(x) hclust(x, method="complete")
#distfunc <- function(x) dist(x,method="euclidean")
# Try using daisy GOWER function 
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

我得到的错误是这样的：

  Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Error in hclust(x, method = "complete") : 
  NA/NaN/Inf in foreign function call (arg 11)
Calls: hclustfunc -> hclust
Execution halted

【问题讨论】：

"NA" 与 NA 不同。但除此之外，当 NA 是其中一个值时，您建议如何定义两点之间的距离？
据我所知daisy 照顾那个stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html
我不明白，你是怎么解决这个问题的？我遇到了同样的错误消息，我找不到任何解释该做什么的网站。我不想简单地删除 NA 值，我希望它们在我的热图中作为“缺失”或类似的东西。如果您想通了，请发布答案。谢谢。

标签： r cluster-analysis bioconductor

【解决方案1】：

错误是由于数据中存在非数字变量（编码为字符串的数字）。您可以将它们转换为数字：

mydata <- apply( mtcars, 2, as.numeric )
d <- distfunc(mydata)

【讨论】：

@neversaint，当您将 NA 值分配给数字 data.frame 时，不要使用引号。这导致了你的问题。引号用于分隔字符常量。如果打算更改为数字矩阵，则数据中存在的字符值会将矩阵中的其余值强制转换为字符。
在您的更新中，文件不是制表符分隔的：您最终只有一列，并且由于其内容（整行）无法转换为数字，因此所有内容都替换为NA。

【解决方案2】：

在这种情况下使用 as.numeric 可能会有所帮助，但我确实认为原始问题指向 daisy 函数中的错误。具体来说，它有如下代码：

    if (any(ina <- is.na(type3))) 
    stop(gettextf("invalid type %s for column numbers %s", 
        type2[ina], pColl(which(is.na))))

没有打印出预期的错误消息，因为which(is.na) 是错误的。应该是which(ina)。

我想我现在应该找出在哪里/如何提交这个错误。

【讨论】：

确实，感谢@rakensi，还报告了导致“不是很有帮助”的错误消息而不是有用的错误消息的错字/thinko。如您所知，我已经修复了cluster包（svn.r-project.org/R-packages/trunk/cluster/R/daisy.q）的开发版本中的代码。