如何矢量化 Fisher 精确检验？答案

【问题标题】：How to vectorize Fisher's exact test?如何矢量化 Fisher 精确检验？
【发布时间】：2016-04-29 02:30:56
【问题描述】：

是否可以使用 Fisher 精确检验的矢量化优化此计算，如果可以，如何优化？当num_cases > ~1000000 时，运行时间很麻烦。

import numpy as np
from scipy.stats import fisher_exact

num_cases = 100
randCounts = np.random.random_integers(100,size=(num_cases,4))

def testFisher(randCounts):
    return [fisher_exact([[r[0],r[1]],[r[2], r[3]]])[0] for r in randCounts]

In [6]: %timeit testFisher(randCounts)
        1 loops, best of 3: 524 ms per loop

【问题讨论】：

谢谢，这正是我想问的。是否有必要对fisher_exact 进行矢量化？我也想将这个概念应用到其他统计方法中。
是的，有必要对fisher_exact进行矢量化（这意味着首先对阶乘进行矢量化）。目前，您的计算时间在 num_cases 中只是线性的。矢量化是一种改进的方法。可能 cython 或 numba 会有所帮助，但前提是 Fisher_exact 的 scipy 版本尚未被 cythonized（我认为是，但不知道事实）。
好像是纯python：github.com/scipy/scipy/blob/master/scipy/stats/stats.py
所以我推测了一下，但确切的费舍尔取决于阶乘，看起来问题可能是大阶乘可能超过 numpy 整数的范围，这使得它很难拥有一个快速的实现（所以这就是它在纯 python 中的原因？）——因为大整数将成为对象。我怀疑对于速度很重要的情况，人们将使用一些近似的测试（并且基于浮点数而不是整数）而不是精确的。不过不知道，这超出了我的知识范围，并且也开始成为统计问题，而不是编程问题。
我希望这不会变成统计问题。只是在寻找加快速度的想法/资源。

标签： python numpy scipy vectorization

【解决方案1】：

这是使用在fisher 中实现的精确费希尔的答案。我在 numpy 中手动计算 OR。

安装：

# pip install fisher
# or 
# conda install -c bioconda fisher

设置：

import numpy as np
np.random.seed(0)
num_cases = 100
c = np.random.randint(100,size=(num_cases,4), dtype=np.uint)

# head, i.e. 
c[:5]
# array([[44, 47, 64, 67],
#   [67,  9, 83, 21],
#   [36, 87, 70, 88],
#   [88, 12, 58, 65],
#   [39, 87, 46, 88]], dtype=uint64)

执行：

from fisher import pvalue_npy
_, _, twosided = pvalue_npy(c[:, 0], c[:, 1], c[:, 2], c[:, 3])
odds = (c[:, 0] * c[:, 3]) / (c[:, 1] * c[:, 2])

print("result fast p and odds", odds[0], twosided[0])
# result fast p and odds 0.9800531914893617 1.0
print("result slow", fisher_exact([[c[0][0], c[0][1]], [c[0][2], c[0][3]]]))
# result slow (0.9800531914893617, 1.0)

请注意，对于一百万行，它只需要两秒钟 :)

此外，要计算近似 OR，您可能希望在找到优势比之前向表中添加一个伪计数。这通常比 inf 更有趣，因为您可以比较近似值:)：

c2 = c + 1
odds = (c2[:, 0] * c2[:, 3]) / (c2[:, 1] * c2[:, 2])

编辑：

从 0.0.61>= 此方法包含在 pyranges 中，为 pr.stats.fisher_exact。

【讨论】：