正态分布样本的置信区间答案

【问题标题】：Confidence interval of normal distribution samples正态分布样本的置信区间
【发布时间】：2018-09-21 00:18:12
【问题描述】：

我想找出服从正态分布的样本的置信区间。

为了测试代码，我首先创建了一个示例，并尝试在 Jupyter notebook[python 内核] 中绘制置信区间的图片

%matplotlib notebook

import pandas as pd
import numpy as np
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt

s= np.random.normal(0,1,2000)
# s= range(10,14)                   <---this sample has the right CI
# s = (0,0,1,1,1,1,1,2)             <---this sample has the right CI

# confidence interval
# I think this is the fucniton I misunderstand
ci=sms.DescrStatsW(s).tconfint_mean()

plt.figure()
_ = plt.hist(s,  bins=100)

# cnfidence interval left line
one_x12, one_y12 = [ci[0], ci[0]], [0, 20]
# cnfidence interval right line
two_x12, two_y12 = [ci[1], ci[1]], [0, 20]

plt.plot(one_x12, one_y12, two_x12, two_y12, marker = 'o')

绿线和黄线假设是置信区间。但他们不在正确的位置。

我可能误解了这个功能：

sms.DescrStatsW(s).tconfint_mean()

但是文档说这个函数会返回置信区间。

这是我期望的数字：

%matplotlib notebook

import pandas as pd
import numpy as np
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt

s= np.random.normal(0,1,2000)


plt.figure()
_ = plt.hist(s,  bins=100)
# cnfidence interval left line
one_x12, one_y12 = [np.std(s, axis=0) * -1.96, np.std(s, axis=0) * -1.96], [0, 20]
# cnfidence interval right line
two_x12, two_y12 = [np.std(s, axis=0) * 1.96, np.std(s, axis=0) * 1.96], [0, 20]

plt.plot(one_x12, one_y12, two_x12, two_y12, marker = 'o')

【问题讨论】：

tconfint_mean 返回估计平均参数的置信区间，而不是单个观察值。
@user333700 哦！这就是我误解的地方。谢谢你的指出。

标签： python matplotlib jupyter statsmodels confidence-interval

【解决方案1】：

问题看起来像“有什么函数可以计算置信区间”。

由于给定的数据处于正态分布，这可以简单地通过

ci = scipy.stats.norm.interval(0.95, loc=0, scale=1)

0.95 是 alpha 值，它指定了 95 个百分位点，因为公式中给出了相应的 1.96 个平均值的标准差。 (https://en.wikipedia.org/wiki/1.96)

loc=0 指定平均值，scale=1 用于 sigma。 (https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)

您可以查看@bogatron 的答案以了解有关Compute a confidence interval from sample data 的更多详细信息

以下代码生成您想要的绘图。我为可重复性播种了随机数。

import pandas as pd
import numpy as np
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt
import scipy

s = np.random.seed(100)
s= np.random.normal(0,1,2000)

plt.figure()
_ = plt.hist(s,  bins=100)

sigma=1
mean=0
ci = scipy.stats.norm.interval(0.95, loc=mean, scale=sigma)
print(ci)

# cnfidence interval left line
one_x12, one_y12 = [ci[0],ci[0]], [0, 20]
# cnfidence interval right line
two_x12, two_y12 = [ci[1],ci[1]], [0, 20]

plt.plot(one_x12, one_y12, two_x12, two_y12, marker = 'o')

ci 返回

(-1.959963984540054, 1.959963984540054)

这是情节。

【讨论】：

我不认为这是数学意义上的confidence interval。但是这个区间满足标准正态分布随机变量在这个区间内被抽样的概率是 95%。这可能就是问题所在，但不应将此称为“confidence interval”。我认为这可以让我解释“prediction interval”的一个特例而没有任何认知上的不确定性。