Fisher 和 Pearson 的独立性检验答案

【问题标题】：Fisher's and Pearson's test for indepedenceFisher 和 Pearson 的独立性检验
【发布时间】：2015-09-29 23:36:17
【问题描述】：

在 R 中，我有 2 个数据集：group1 和 group2。

对于group 1，我有10 个game_id，这是一个游戏的ID，我们有number，这是在group1 中玩这个游戏的次数。

所以如果我们输入

group1

我们得到这个输出

game_id  number
1        758565
2        235289
...
10       87084

对于group2，我们得到

game_id  number
1        79310
2        28564
...
10       9048

如果我想测试前 2 个 game_id 的 group1 和 group2 之间是否存在统计差异，我可以使用 Pearson 卡方检验。

在 R 中我只是创建矩阵

# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)

所以m给了我们

a 758565  235289
b 79310  28564

这里我可以测试H：“a 独立于b”，意思是group1 中的用户玩game_id 1 比group2 多2。

在 R 中，我们输入 chisq.test(m)，我们得到一个非常低的 p 值，这意味着我们可以拒绝 H，这意味着 a 和 b 不是独立的。

如何找到game_id 在group1 中的播放次数明显多于在group2 中的播放次数？

【问题讨论】：

您的卡方检验在统计上无效
为什么无效？
因为它违反了 Pearson 卡方检验的假设。您的列联表中的事件必须是互斥的并且总和为 1，您只考虑部分表与您应该是的 10 x 10 表（即假设玩家只能玩 1-10 游戏并且这些是不仅仅是前 10 名的数量）。
我不知道您如何将用户分类为好/坏，但是对于每个游戏，您需要知道 a) 有多少好用户玩过，有多少好用户没玩过，b) 如何有多少坏用户玩了，有多少坏用户没玩。然后你可以比较每场比赛的百分比。
那么我应该如何解决这个问题。如果我的列联表应该求和运行，我可以创建一个新列，显示固定 game_id 在 group1 中的百分比。例如，对于 game_id 1，我们得到 758565/sum(group1[,2]) = 9%。我对所有 game_id 执行此操作，总和为 1。

标签： r statistics

【解决方案1】：

我创建了一个只有 3 个游戏的简单版本。我正在使用卡方检验和比例比较检验。就个人而言，我更喜欢第二个，因为它可以让您了解要比较的百分比。运行脚本并确保您了解该过程。

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084))

dt_group1

#   game_id number_games
# 1       1       758565
# 2       2       235289
# 3       3        87084


# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games   # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games)  # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games  # just to get an idea about the percentages

dt_group1

#   game_id number_games number_rest_games number_all_games        Prc
# 1       1       758565            322373          1080938 0.70176550
# 2       2       235289            845649          1080938 0.21767113
# 3       3        87084            993854          1080938 0.08056336



# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048))

# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games




# input the game id you want to investigate
input_game_id = 1

# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
                c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))

# perform chi sq test
chisq.test(dt_test)

# Pearson's Chi-squared test with Yates' continuity correction
# 
# data:  dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16


# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])

# perform test of proportions
prop.test(x,y)

# 2-sample test for equality of proportions with continuity correction
# 
# data:  x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
#   0.02063233 0.02626776
# sample estimates:
#   prop 1    prop 2 
# 0.7017655 0.6783155

主要是chisq.test 是一个比较计数/比例的测试，因此您需要为您比较的组提供“成功”和“失败”的数量（列联表作为输入）。 prop.test 是另一个计数/比例测试命令，您需要提供“成功”和“总数”的数量。

既然您对结果感到满意并且您已经了解了该过程的工作原理，那么我将添加一种更有效的方法来执行这些测试。

第一个是使用dplyr 和broom 包：

library(dplyr)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = rbind(dt_group1, dt_group2)


dt %>%
  group_by(group_id) %>%                                           # for each group id
  mutate(number_all_games = sum(number_games),                     # create new columns
         number_rest_games = number_all_games - number_games,
         Prc = number_games / number_all_games) %>%
  group_by(game_id) %>%                                            # for each game
  do(tidy(prop.test(.$number_games, .$number_all_games))) %>%      # perform the test
  ungroup()


#   game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
#     (int)      (dbl)      (dbl)     (dbl)        (dbl)     (dbl)        (dbl)        (dbl)
# 1       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

另一个是使用data.table 和broom 包：

library(data.table)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))

# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]

dt[, `:=`(number_rest_games = number_all_games - number_games,
          Prc = number_games / number_all_games) , by=group_id]

# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]


#    game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
# 1:       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2:       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3:       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

您可以看到每一行代表一个游戏，比较是在第 1 组和第 2 组之间。您可以从相应列中获取 p 值，但也可以获取测试/比较的其他信息。

【讨论】：

谢谢。这一切都说得通。我在我所有的 10 个游戏 ID 上都尝试了它，但我在所有这些上都得到了低 p 值。这意味着对于任何游戏，我们在 group1 和 group2 之间都是独立的。我觉得这有点奇怪。
不要忘记，当您处理统计显着性和 p 值时，只要您有大量（足够）的观察。这就是为什么当您想要执行这些比较时会有“有效样本量”的概念。看看这个以及如何设计实验以进行百分比比较。
我有很多数据，所以功效或样本量应该不是问题。我是这样看的：对于每场比赛，两组之间都存在依赖关系。所以差的用户有一些“流行”的游戏，好用户有他们的流行游戏。没有一款游戏对这两个群体同样受欢迎。
正是因为您拥有大量数据，您应该期望即使是很小的差异也会被捕获/分类为具有统计意义的。既然你对结果很好，我也打算更新我的答案（添加一种更有效的方法）
关键是你必须提前决定你认为什么是真正的影响，并尝试收集尽可能多的观察结果（有效的样本量），以便在统计上捕捉这种影响/差异重要的。您应该搜索“实验设计”、“AB 测试设计”等，看看它是如何工作的。在没有实验设计的情况下，您可以在分析后报告您的发现，并让公司决定如何处理小的统计显着差异。