不确定完成方式有多好,但我已经按照连续变量的 t 检验和分类变量的卡方检验:
代码如下:
batch1..batch4 是我想比较的可用样本,final_data 是这里的实际人口。
**
- t 测试循环中的连续变量
column = ["col1", "col2", "col3","col4"]
for x in column:
print("batch1 "+x+" "+str(stats.ttest_1samp(batch1[x], final_data[x].mean())))
print("batch2 "+x+" "+str(stats.ttest_1samp(batch2[x], final_data[x].mean())))
print("batch3 "+x+" "+str(stats.ttest_1samp(batch3[x], final_data[x].mean())))
print("batch4 "+x+" "+str(stats.ttest_1samp(batch4[x], final_data[x].mean())))
分类变量的卡方
我先做了一个函数
def pearsonChiSqGof(myData,field,exp=None):
myFreq=myData[field].value_counts()
df=len(myFreq)-1
if exp==None:
minE=sum(myFreq)/len(myFreq)
chiVal,pval=chisquare(myFreq)
else:
minE=min(exp)
chiVal,pval=chisquare(myFreq,exp)
warning=None
if minE<5:
warning='minimum expected counl less than 5,chi-square test result not reliable'
return chiVal,pval,df,minE,warning
然后在所有列上运行一个循环
fieldList=[column list]
for x in fieldList:
print("batch1 "+x+" "+str(pearsonChiSqGof(batch1,x)))
print("batch2 "+x+" "+str(pearsonChiSqGof(batch2,x)))
print("batch3 "+x+" "+str(pearsonChiSqGof(batch3,x)))
print("batch4 "+x+" "+str(pearsonChiSqGof(batch4,x)))