StatsModels 的置信区间和预测区间答案

【问题标题】：confidence and prediction intervals with StatsModelsStatsModels 的置信区间和预测区间
【发布时间】：2013-07-07 17:16:29
【问题描述】：

我这样做linear regression 和StatsModels：

import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

n = 100

x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)

re = sm.OLS(y, X).fit()
print(re.summary())

prstd, iv_l, iv_u = wls_prediction_std(re)

我的问题是，iv_l 和 iv_u 是上下置信区间还是预测区间？

我如何得到别人？

我需要所有点的置信区间和预测区间来绘制图表。

【问题讨论】：

stackoverflow.com/a/63560689/10375049

标签： python statistics statsmodels

【解决方案1】：

对于测试数据，您可以尝试使用以下内容。

predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)

我发现summary_frame() 方法被埋在here 中，你可以找到get_prediction() 方法here。您可以通过修改“alpha”参数来更改置信区间和预测区间的显着性水平。

我在这里发帖是因为这是在寻找置信区间和预测区间的解决方案时出现的第一个帖子——尽管这与测试数据本身有关。

这是一个使用这种方法获取模型、新数据和任意分位数的函数：

def ols_quantile(m, X, q):
  # m: OLS model.
  # X: X matrix.
  # q: Quantile.
  #
  # Set alpha based on q.
  a = q * 2
  if q > 0.5:
    a = 2 * (1 - q)
  predictions = m.get_prediction(X)
  frame = predictions.summary_frame(alpha=a)
  if q > 0.5:
    return frame.obs_ci_upper
  return frame.obs_ci_lower

【讨论】：

predictions.summary_frame(alpha=0.05) 对我抛出错误 (TypeError: 'builtin_function_or_method' object is not iterable)。我在 github 上提出了一个问题：github.com/statsmodels/statsmodels/issues/4437
什么是out_of_sample_df？或者更一般地说，get_prediction() 采用什么参数？当我尝试喂它时，例如预测的 x 值，它ValueErrors out.
@Dan 见statsmodels.org/dev/generated/…。
@Dan 检查是否添加了常量值。

【解决方案2】：

更新请参阅更新的the second answer。一些模型和结果类现在有一个get_prediction 方法，它提供了额外的信息，包括预测平均值的预测区间和/或置信区间。

旧答案：

iv_l 和 iv_u 为您提供每个点的预测区间限制。

预测区间是观察的置信区间，包括对误差的估计。

我认为，statsmodels 中尚不提供平均预测的置信区间。（其实拟合值的置信区间隐藏在influence_outlier的summary_table中，但我需要验证这一点。）

统计模型的正确预测方法在 TODO 列表中。

加法

OLS 有置信区间，但访问有点笨拙。

运行脚本后包含：

from statsmodels.stats.outliers_influence import summary_table

st, data, ss2 = summary_table(re, alpha=0.05)

fittedvalues = data[:, 2]
predict_mean_se  = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T

# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))

plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()

这应该给出与 SAS 相同的结果，http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html

【讨论】：

这种方法的一个问题是，如果点稀疏，predict_mean_ci_low 和predict_mean_ci_upp 在绘制时将是锯齿状/尖状的，因为它们只存在于拟合值而不是范围的点。但是，拟合线是为所有点定义的。在github.com/statsmodels/statsmodels/blob/master/statsmodels/… 中有一条评论说using hat_matrix only works for fitted values - 知道如何解决这个问题吗？
我在将此答案应用于我的数据集时遇到问题，在此处作为单独的问题发布：stackoverflow.com/questions/34998772/…。非常感谢任何建议！
这是一个老问题，但基于这个答案，怎么可能只得到低于 95 CI 的数据点？我将此作为新问题发布stackoverflow.com/questions/50585837/…
当一个人使用“fit_regularized()”时，难道没有办法做同样的事情吗？似乎所有方法都适用于正常的“fit()”

【解决方案3】：

summary_frame 和 summary_table 在您需要单个分位数的精确结果但不能很好地矢量化时工作得很好。这将提供预测区间（不是置信区间）的正态近似值，并适用于分位数向量：

def ols_quantile(m, X, q):
  # m: Statsmodels OLS model.
  # X: X matrix of data to predict.
  # q: Quantile.
  #
  from scipy.stats import norm
  mean_pred = m.predict(X)
  se = np.sqrt(m.scale)
  return mean_pred + norm.ppf(q) * se

【讨论】：

【解决方案4】：

对于时间序列结果，您可以使用get_forecast() 方法获得更平滑的图。时间序列示例如下：

# Seasonal Arima Modeling, no exogenous variable
model = SARIMAX(train['MI'], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)

results = model.fit()

results.summary()

下一步是进行预测，这会生成置信区间。

# make the predictions for 11 steps ahead
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean

这些可以放在数据框中，但需要一些清理：

# get a better view
predictions_int.conf_int()

连接数据框，但清理标题

conf_df = pd.concat([test['MI'],predictions_int.predicted_mean, predictions_int.conf_int()], axis = 1)

conf_df.head()

然后我们重命名列。

conf_df = conf_df.rename(columns={0: 'Predictions', 'lower MI': 'Lower CI', 'upper MI': 'Upper CI'})
conf_df.head()

制作情节。

# make a plot of model fit
# color = 'skyblue'

fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)


x = conf_df.index.values


upper = conf_df['Upper CI']
lower = conf_df['Lower CI']

conf_df['MI'].plot(color = 'blue', label = 'Actual')
conf_df['Predictions'].plot(color = 'orange',label = 'Predicted' )
upper.plot(color = 'grey', label = 'Upper CI')
lower.plot(color = 'grey', label = 'Lower CI')

# plot the legend for the first plot
plt.legend(loc = 'lower left', fontsize = 12)


# fill between the conf intervals
plt.fill_between(x, lower, upper, color='grey', alpha='0.2')

plt.ylim(1000,3500)

plt.show()

【讨论】：

【解决方案5】：

您可以使用我的 repo (https://github.com/shahejokarian/regression-prediction-interval) 中的 Ipython notebook 中的 LRPI() 类来获取预测区间。

您需要设置 t 值以获得预测值所需的置信区间，否则默认为 95% conf。间隔。

LRPI 类使用 sklearn.linear_model 的 LinearRegression 、numpy 和 pandas 库。

笔记本中也有一个例子。

【讨论】：

【解决方案6】：

您可以根据 statsmodel 给出的结果和正态假设来计算它们。

以下是平均值的 OLS 和 CI 示例：

import statsmodels.api as sm
import numpy as np
from scipy import stats

#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.

#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df =  len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 +  t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))

您可以使用输入结果、点 x0 和显着性水平 sl 围绕它包装一个很好的函数。

我现在不确定你是否可以将它用于 WLS()，因为那里发生了额外的事情。

参考：[D.C. 中的 Ch3。蒙哥马利和 E.A.啄。 “线性回归分析简介。”第四。编辑，威利，1992]。

【讨论】：