R 中的关联度量——Kendall 的 tau-b 和 tau-c答案

【问题标题】：Measures of association in R -- Kendall's tau-b and tau-cR 中的关联度量——Kendall 的 tau-b 和 tau-c
【发布时间】：2011-02-03 05:05:09
【问题描述】：

是否有任何 R 软件包可用于计算 Kendall 的 tau-b 和 tau-c 及其相关的标准误差？我在 Google 和 Rseek 上的搜索结果一无所获，但肯定有人在 R 中实现了这些。

【问题讨论】：

手工计算比较后发现cor(x, y, method = "kendall")（在预装的stats包中）提供了Kendall的tau-b，不是 Kendall的tau-a. （至少，从 R 版本 3.0.2 开始。）

标签： r statistics distribution

【解决方案1】：

你试过cor这个功能吗？有一种方法可以设置为"kendall"（如果需要，还可以设置"pearson" 和"spearman"），不确定这是否涵盖了您正在寻找的所有标准错误，但它应该可以帮助您入门。

【讨论】：

-1：这并没有提到 Kendall 的 Tau-b 或 Tau-c，所以它没有回答问题。

【解决方案2】：

只是为了扩展 Stedy 的答案...cor(x,y,method="kendall") 会给你相关性，cor.test(x,y,method="kendall") 会给你一个 p 值和 CI。

另外，看看 Kendall 包，它提供了一个声称更好近似的函数。

> library(Kendall)
> Kendall(x,y)

Deducer 包中还有 cor.matrix 函数可以很好地打印：

> library(Deducer)
> cor.matrix(variables=d(mpg,hp,wt),,
+ data=mtcars,
+ test=cor.test,
+ method='kendall',
+ alternative="two.sided",exact=F)

                          Kendall's rank correlation tau                          

           mpg     hp      wt     
mpg    cor 1       -0.7428 -0.7278
         N 32      32      32     
    stat**         -5.871  -5.798 
   p-value         0.0000  0.0000 
----------                        
 hp    cor -0.7428 1       0.6113 
         N 32      32      32     
    stat** -5.871          4.845  
   p-value 0.0000          0.0000 
----------                        
 wt    cor -0.7278 0.6113  1      
         N 32      32      32     
    stat** -5.798  4.845          
   p-value 0.0000  0.0000         
----------                        
    ** z
    HA: two.sided

【讨论】：

-1：这并没有提到 Kendall 的 Tau-b 或 Tau-c，所以它没有回答问题。

【解决方案3】：

psych 包中有一个 Kendall 系数例程corr.test(x, method = "kendall")。此函数可以应用于 data.frame，并且还显示每对变量的 p-values。我猜它显示 tau-a 系数。唯一的缺点是它实际上是cor() 函数的包装器。

维基百科在肯德尔系数上有good reference，并检查this link。试试sos 包和findFn() 函数。在查询"tau a" 和tau b 时，我得到了一堆东西，但都以失败告终。正如 @Ian 建议的那样，搜索结果似乎合并到 Kendall 包中。

【讨论】：

@Firefeather 在此页面上的 cmets 表明 cor 计算 tau-b。

【解决方案4】：

有三个 Kendall tau 统计数据（tau-a、tau-b 和 tau-c).

它们不可互换，到目前为止发布的答案都没有涉及最后两个，这是 OP 问题的主题。

我无法在 R Standard Library (stat et al.) 或任何CRAN 或其他存储库上提供的软件包。我使用了优秀的 R 包 sos 进行搜索，所以我相信返回的结果相当彻底。

这就是对 OP 问题的简短回答：tau-b 或 tau-c 没有内置或打包功能。

但很容易自己动手。

为 Kendall 统计量编写 R 函数只是一个问题将这些方程式翻译成代码：

Kendall_tau_a = (P - Q) / (n * (n - 1) / 2)

Kendall_tau_b = (P - Q) / ( (P + Q + Y0) * (P + Q + X0) ) ^ 0.5 

Kendall_tau_c = (P - Q) * ((2 * m) / n ^ 2 * (m - 1) )

tau-a： 等于一致减去不一致对，再除以一个因子以计算对总数（样本大小）。

tau-b：明确说明关系——即数据对的两个成员具有相同的值；此值等于一致减去不一致对除以 项，该项表示 x (X0) 上未绑定的对数与 y (Y0) 上未绑定的数之间的几何平均值。

tau-c： 更大的表格变体也针对非方形表格进行了优化；等于一致减去不一致对乘以调整表大小的因子）。

# Number of concordant pairs.
P = function(t) {
  r_ndx = row(t)
  c_ndx = col(t)
  sum(t * mapply(function(r, c){sum(t[(r_ndx > r) & (c_ndx > c)])},
    r = r_ndx, c = c_ndx))
}

# Number of discordant pairs.
Q = function(t) {
  r_ndx = row(t)
  c_ndx = col(t)
  sum(t * mapply( function(r, c){
      sum(t[(r_ndx > r) & (c_ndx < c)])
  },
    r = r_ndx, c = c_ndx) )
}

# Sample size (total number of pairs).
n = n = sum(t)

# The lesser of number of rows or columns.
m = min(dim(t))

所以这四个参数就是计算tau-a、tau-b和tau-c所需的全部：

P
问
米
n

（加上 tau-b 的 XO 和 Y0）

例如，tau-c 的代码是：

kendall_tau_c = function(t){
    t = as.matrix(t) 
    m = min(dim(t))
    n = sum(t)
    ks_tauc = (m * 2 * (P(t) - Q(t))) / ((n ^ 2) * (m - 1))
}

那么，Kendall 的 tau 统计与分类数据分析中使用的其他统计测试有什么关系？

所有三个 Kendall tau 统计量以及 Goodman 和 Kruskal 的 gamma 都用于序数和二进制数据的相关性。（Kendall tau 统计是 gamma 统计（仅 P-Q）的更复杂的替代方案。）

因此，Kendalls 的 tau 和 gamma 对应于简单的卡方和 Fisher 精确检验 ，这两者（据我所知）仅适用于名义数据。

示例：

cpa_group = c(4, 2, 4, 3, 2, 2, 3, 2, 1, 5, 5, 1)
revenue_per_customer_group = c(3, 3, 1, 3, 4, 4, 4, 3, 5, 3, 2, 2)
weight = c(1, 3, 3, 2, 2, 4, 0, 4, 3, 0, 1, 1)

dfx = data.frame(CPA=cpa_group, LCV=revenue_per_customer_group, freq=weight)

# Reshape data frame so 1 row for each event 
# (predicate step to create contingency table).
dfx2 = data.frame(lapply(dfx, function(x) { rep(x, dfx$freq)}))

t = xtabs(~ revenue + cpa, dfx)

kc = kendall_tau_c(t)

# Returns -.35.

【讨论】：

我在一些地方看到可以使用 Kendall 的 tau 来获取连续数据，例如这个 SPSS 网站：statistics.laerd.com/spss-tutorials/…
顺便说一句，Stuart 的 C 有一个小错误，它是 ((P - Q) * (2 * m)) / (n ^ 2 * (m - 1) ) 不是 (P - Q ) * ((2 * m) / n ^ 2 * (m - 1) )，对吗？

【解决方案5】：

今天偶然发现了这个页面，因为我正在寻找 R 中 kendall tau-b 的实现
对于寻找相同事物的其他人：
tau-b 实际上是 stats 包的一部分。

查看此链接了解更多详情： https://stat.ethz.ch/pipermail/r-help//2012-August/333656.html

我试过了，它有效：图书馆（统计）

x <- c(1,1,2)
y<-c(1,2,3)
cor.test(x, y, method = "kendall", alternative = "greater")

这是输出：

data:  x and y
z = 1.2247, p-value = 0.1103
alternative hypothesis: true tau is greater than 0
sample estimates:
      tau 
0.8164966 

Warning message:
In cor.test.default(x, y, method = "kendall", alternative = "greater") :
  Cannot compute exact p-value with ties

忽略警告信息。 tau 实际上是 tau b !!!

【讨论】：

确实，cor.test(x, y, method = "kendall") 和 cor(x, y, method = "kendall") 都会计算 Kendall 的 Tau-b。

【解决方案6】：

相当长的一段时间，但这 3 个功能是在 DescTools 中实现的。

library(DescTools)
# example in: 
# http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf
# pp. S. 1821
tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23)))

# tau-a
KendallTauA(tab, conf.level=0.95)
tau_a    lwr.ci    ups.ci 
0.2068323 0.1771300 0.2365346 

# tau-b
KendallTauB(tab, conf.level=0.95)
    tau_b    lwr.ci    ups.ci 
0.3372567 0.2114009 0.4631126 

# tau-c
> StuartTauC(tab, conf.level=0.95)
     tauc    lwr.ci    ups.ci 
0.4110953 0.2546754 0.5675151 

# alternative for tau-b:
d.frm <- Untable(tab, dimnames = list(1:2, 1:5))
cor(as.numeric(d.frm$Var1), as.numeric(d.frm$Var2),method="kendall")
[1] 0.3372567

# but no confidence intervalls for tau-b! Check:
unclass(cor.test(as.numeric(d.frm$Var1), as.numeric(d.frm$Var2), method="kendall"))

【讨论】：

【解决方案7】：

根据这个 r-tutor 页面http://www.r-tutor.com/gpu-computing/correlation/kendall-tau-b tau-b 实际上是由基本 r 函数计算的。

【讨论】：

【解决方案8】：

道格在这里的回答是不正确的。 Kendall 包可以用来计算 Tau b。

Kendall 包函数 Kendall（看起来也是 cor(x,y,method="kendall")）使用 Tau-b 的公式计算平局。但是，对于有关系的向量，Kendall 包具有更正确的 p 值。请参阅 Kendall 文档的第 4 页，来自 https://cran.r-project.org/web/packages/Kendall/Kendall.pdf 第 4 页，其中 D 引用了 Kendall 计算的分母：

和 D = n(n - 1)/2。 S 称为分数，分母 D 是 S 的最大可能值。当有平局时，D 的公式更复杂（Kendall，1974 年，第 3 章），这两个 reankings 中的平局的一般公式是在我们的函数中实现。在没有关联的情况下，使用 Best 和 Gipps (1974) 给出的精确算法计算 tau 在无关联的零假设下的 p 值。当存在关系时，通过将 S 视为具有均值零和方差 var(S) 的正态分布，使用具有连续性校正的正态近似，其中 var(S) 由 Kendall (1976, eqn 4.4, p.55) 给出。除非关系非常广泛和/或数据非常短，否则这个近似值就足够了。如果存在广泛的联系，那么引导程序提供了一种权宜之计（Davis 和 Hinkley，1997）。或者，也可以使用基于穷举枚举的精确方法（Valz 和 Thompson，1994 年），但此包中没有实现。

我最初对 Doug 的回答进行了编辑，但由于“针对作者并且更适合作为答案或评论”而被拒绝。我会将此作为对答案的评论，但我的声誉还不够高，无法发表评论。

【讨论】：

【解决方案9】：

我一直在对 Kendall 的 tau 进行一些研究。直接使用 cor(x, y, method="kendall") 将得到 Kendall 的 tau-b，这与原始定义（即 Kendall 的 tau-a）略有不同。 Kendall 的 tau-b 更常用，因为它考虑了关系，因此，大多数可用的软件包（例如 cor()、Kendall()）都计算 Kendall 的 tau-b。

Kendall 的 tau-a 和 tau-b 的区别本质上是分母。具体来说，对于 Kendall 的 tau-a，分母 D=n*(n-1)/2，这是固定的，而对于 Kendall 的 tau-b，分母 D=sqrt(Var1 的 No. pairs of Var1 不包括已绑定的对)*sqrt （不包括绑定对的 Var2 对数）。 tua-b 的值通常大于 tau-a。

举个简单的例子，考虑 X=(1,2,3,4,4)，Y=(2,3,4,4,4)。 Kendall 的 tau-b=0.88，而 tau-a=0.7。

对于 Kendall 的 tau-c，我没看到太多，所以没有 cmets。

【讨论】：