R geepack：使用 GEE 进行不合理的大估计答案

【问题标题】：R geepack: unreasonably large estimates using GEER geepack：使用 GEE 进行不合理的大估计
【发布时间】：2017-05-31 17:27:10
【问题描述】：

我正在使用geepack for R 通过geeglm() 估计逻辑边际模型。但我得到了垃圾估计。它们大约大了 16 个数量级。但是 p 值似乎与我的预期相似。这意味着响应本质上变成了阶跃函数。见附图

这是生成绘图的代码：

require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)

这是回归表：

Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data, 
    id = defacto, corstr = "exchangeable")

 Coefficients:
             Estimate   Std.err  Wald Pr(>|W|)    
(Intercept) -7.38e+15  1.47e+15  25.1  5.4e-07 ***
power        2.05e+13  1.60e+12 164.4  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
            Estimate  Std.err
(Intercept) 1.03e+15 1.65e+37

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate  Std.err
alpha    0.196 3.15e+21
Number of clusters:   3   Maximum cluster size: 381

希望得到一些帮助。谢谢！

亲切的问候，

马吕斯

【问题讨论】：

您将需要某种正则化或收缩组件。您可以使用广义线性混合模型 + 贝叶斯先验对固定效应（MCMCglmm 或 blme 包）执行此操作，但它将适合条件模型而不是边际模型......我不知道如何临时在 GEE 框架中实现收缩，或者是否有人已经这样做了。
我有一个边际逻辑方法，它为(Intercept) 提供-0.664，为power 提供0.003。我写出来有兴趣吗？
@swihart : 当然
出于好奇，数据应用是什么？我很感兴趣，因为我通常在有很多集群的情况下工作，每个集群只有几个观察值——而这里的一个集群有 3 个集群和一个集群上的 381 个观察值。
@swihart 有一个生物学应用程序。在一项实验中，数百个人在恰好 3 个环境中长大。我们想研究一个人在给定体重指数的情况下变得成熟的概率。但我们预计环境会引起相关性。

标签： r logistic-regression glm mixed-models random-effects

【解决方案1】：

我将给出三个过程，每个过程都是边缘化随机截距模型 (MRIM)。这些 MRIM 的系数具有边际逻辑解释，并且比 GEE 的量级更小：

| Model | (Intercept) |  power |  LogL  |
|-------|-------------|--------|--------| 
| `L_N` |       -1.050| 0.00267|  -270.1|
| `LLB` |       -0.668| 0.00343|  -273.8|
| `LPN` |       -1.178| 0.00569|  -266.4|

与不考虑任何相关性的 glm 相比，供参考：

| Model | (Intercept) |  power |  LogL  |
|-------|-------------|--------|--------| 
|  strt |       -0.207| 0.00216|  -317.1|

边缘化随机截距模型 (MRIM) 值得探索，因为您需要一个具有可交换相关结构的边缘模型用于聚类数据，这就是 MRIM 所展示的结构类型。

文献的代码（尤其是R script with comments）和PDF在GITHUB repo中。我在下面详细介绍了代码和文献。

MRIM 的概念自 1999 年以来一直存在，有关此的一些背景资料在 GITHUB repo 中。我建议先阅读 Swihart et al 2014，因为它回顾了其他论文。

按时间顺序--

L_NHeagerty (1999)：该方法适合具有正态分布随机截距的随机截距逻辑模型。诀窍是随机截距模型中的预测器是用边际系数非线性参数化的，因此得到的边际模型具有边际逻辑解释。它的代码是lnMLE R 包（不在 CRAN 上，而是在 Patrick Heagerty 的网站上here）。这种方法在代码中用L_N表示，表示logit（L）在边缘，没有条件尺度（_）的解释和正态（N）分布的随机截距。
LLB Wang & Louis (2003)：该方法适合具有桥分布随机截距的随机截距逻辑模型。与 Heagerty 1999 中的技巧是随机截距模型的非线性预测器不同，技巧是一种特殊的随机效应分布（桥分布），它允许随机截距模型和生成的边际模型都具有逻辑解释。它的代码是用gnlmix4MMM.R（在repo 中）实现的，它使用rmutil 和repeated R 包。这种方法在代码中表示为LLB，表示边缘上的logit（L），条件尺度上的logit（L）和分布截距的桥（B）。
LPN Caffo and Griswold (2006)：该方法适合具有正态分布随机截距的随机截距 probit 模型，而 Heagerty 1999 使用 logit 随机截距模型。这种替换使计算更容易，并且仍然产生边际 logit 模型。它的代码是用gnlmix4MMM.R （在repo 中）实现的，它使用rmutil 和repeated R 包。这种方法在代码中表示为LPN，以表示边际上的logit（L），条件尺度上的概率（P）和正态（N）分布的截距。
Griswold et al (2013)：另一篇评论/实用介绍。
Swihart et al 2014：这是 Heagerty 1999 和 Wang & Louis 2003 以及其他人的评论论文，概括了 MRIM 方法。最有趣的概括之一是允许边际模型和条件模型中的逻辑 CDF（等效地，logit 链接）改为近似逻辑 CDF 的稳定分布。它的代码是用gnlmix4MMM.R （在repo 中）实现的，它使用rmutil 和repeated R 包。我在R script with comments 中表示这个SSS 表示边缘稳定（S），条件尺度稳定（S）和稳定（S）分布截距。它包含在 R 脚本中，但在这篇关于 SO 的帖子中没有详细说明。

准备

#code from OP Question: edit `data` to `d` 
require(geepack)
d = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=d, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=d)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
#get some starting values from glm():
strt <- coef(glm(moden ~  power, family = binomial, data=d))
strt
#I'm so sorry but these methods use attach()
attach(d)

`L_N`Heagerty (1999)

# marginally specifies a logit link and has a nonlinear conditional model
# the following code will not run if lnMLE is not successfully installed.  
# See https://faculty.washington.edu/heagerty/Software/LDA/MLV/
library(lnMLE)
L_N <- logit.normal.mle(meanmodel = moden ~ power,
                        logSigma= ~1,
                        id=defacto,
                        model="marginal",
                        data=d,
                        beta=strt,
                        r=10) 
print.logit.normal.mle(L_N)

准备`LLB` 和`LPN`

library("gnlm")
library("repeated")
source("gnlmix4MMM.R") ## see ?gnlmix; in GITHUB repo 
y <- cbind(d$moden,(1-d$moden))

`LLB` 王和路易斯 (2003)

LLB  <- gnlmix4MMM(y = y,
                   distribution = "binomial",
                   mixture = "normal",
                   random = "rand",
                   nest = defacto,
                   mu = ~ 1/(1+exp(-(a0 + a1*power)*sqrt(1+3/pi/pi*exp(pmix)) - sqrt(1+3/pi/pi*exp(pmix))*log(sin(pi*pnorm(rand/sqrt(exp(pmix)))/sqrt(1+3/pi/pi*exp(pmix)))/sin(pi*(1-pnorm(rand/sqrt(exp(pmix))))/sqrt(1+3/pi/pi*exp(pmix)))))),
                   pmu = c(strt, log(1)),
                   pmix = log(1))

print("code: 1 -best 2-ok 3,4,5 - problem")
LLB$code
print("coefficients")
LLB$coeff
print("se")
LLB$se

`LPN` 卡福和格里斯沃尔德 (2006)

LPN  <- gnlmix4MMM(y = y,
                   distribution = "binomial",
                   mixture = "normal",
                   random = "rand",
                   nest = defacto,
                   mu = ~pnorm(qnorm(1/(1+exp(-a0 - a1*power)))*sqrt(1+exp(pmix)) + rand),
                   pmu = c(strt, log(1)),
                   pmix = log(1))

print("code: 1 -best 2-ok 3,4,5 - problem")
LPN$code
print("coefficients")
LPN$coeff
print("se")
LPN$se

3 种方法的系数：

rbind("L_N"=L_N$beta, "LLB" = LLB$coefficients[1:2], "LPN"=LPN$coefficients[1:2])

3 个模型的最大对数似然：

rbind("L_N"=L_N$logL, "LLB" = -LLB$maxlike, "LPN"=-LPN$maxlike)

【讨论】：

准备

L_NHeagerty (1999)

准备LLB 和LPN