GAM 模型预测与平均值之间的巨大差异答案

【问题标题】：Large difference between GAM model predictions and mean valuesGAM 模型预测与平均值之间的巨大差异
【发布时间】：2020-06-11 00:49:25
【问题描述】：

我有 3 个预测变量：年份 (2010-2019)、管理（禁忌/开放）和捕鱼方法（矛、线、拖钓或网）。我已将 GAM 拟合到三个响应变量：工作量（小时）、捕获量（重量，公斤）和效率（每单位工作量捕获量，CPUE）。对于所有三个模型，管理层都将年份拟合为 Gam，管理和方法也包括为固定效应（该组合的 AIC 最低，与我们的问题最相关）。所有三个响应变量都是倾斜的，需要进行 LogX+1 转换。转换后，权重和 CPUE 仍然偏斜，尽管残差直方图和残差与拟合值的关系图是可以接受的，过度分散测试也是如此。

数据中存在很多可变性，因此模型只能解释 5-10% 的偏差。

我使用 visreg 绘制了覆盖原始数据的模型。然而，鉴于它的传播，这很难解释。相反，然后我使用 geom_pointrange 在模型上绘制了平均值/SE。

对于小时数，模型非常适合每一年/管理类型的均值和误差。
对于权重数字，模型预测远低于每年的平均值/SE，并且 Management="Tabu open" 的斜率不正确。
对于 CPUE，模型形状大致正确，但远高于均值/SE 点范围。

鉴于它适用于 Hours ，我猜问题在于 Weight 和 CPUE 的数据偏斜，即使模型验证是可以接受的。

为什么模型与 Weight 和 CPUE 的均值/SE 数据之间几乎没有重叠，但 Hours 却没有？

是否有另一种更好的方法来绘制模型，而不是与 1. 原始数据或 2. 均值叠加？

代码如下：

summary(reeffish)

      Year            Management       Method        Hours            Weight             CPUE        
 Min.   :2010   Always open:398   Line    :417   Min.   : 0.500   Min.   :  0.000   Min.   : 0.0000  
 1st Qu.:2011   Tabu open  :312   Net     : 49   1st Qu.: 3.000   1st Qu.:  1.196   1st Qu.: 0.2735  
 Median :2013                     Spear   :217   Median : 4.420   Median :  2.200   Median : 0.5092  
 Mean   :2013                     Trolling: 27   Mean   : 5.689   Mean   :  4.108   Mean   : 0.8665  
 3rd Qu.:2014                                    3rd Qu.: 6.567   3rd Qu.:  4.234   3rd Qu.: 0.9472  
 Max.   :2019                                    Max.   :67.330   Max.   :152.000   Max.   :14.7800

#Fit model
Hours.gam1 = gam(log(Hours+1) ~ s(Year, k=5,by=Management) + Management+Method, data= reeffish)
Weight.gam1 = gam(log(Weight+1) ~ s(Year, k=5,by=Management) + Management+Method, data= reeffish)
CPUE.gam1 = gam(log(CPUE+1) ~ s(Year, k=5,by=Management) + Management+Method,data= reeffish)

#Model Validation
#Hours
E3 <- resid(Hours.gam1); F3 <- fitted(Hours.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

#Weight
E3 <- resid(Weight.gam1); F3 <- fitted(Weight.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

#CPUE
E3 <- resid(CPUE.gam1); F3 <- fitted(CPUE.gam1) # get the residuals and fitted values
par(mfrow=c(1,2),mar=c(5,4,3,3))  # this is setting up a multi-panel plot window
hist(E3, breaks=25,main="",xlab="Residuals")
plot(x=F3, y=E3, xlab="Fitted values", ylab="Residuals")
abline(0,0)

Residual plots of Hours, Weight and CPUE

#Model Output
summary(Hours.gam1)
summary(Weight.gam1)
summary(CPUE.gam1)

Family: gaussian 
Link function: identity 

Formula:
log(Hours + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.637127   0.030477  53.717  < 2e-16 ***
ManagementTabu open  0.123398   0.051120   2.414 0.016040 *  
MethodNet            0.284026   0.080653   3.522 0.000457 ***
MethodSpear          0.001778   0.047798   0.037 0.970341    
MethodTrolling      -0.224008   0.103047  -2.174 0.030054 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                edf Ref.df     F p-value  
s(Year):ManagementAlways open 1.000  1.000 2.660  0.1033  
s(Year):ManagementTabu open   1.908  2.262 3.463  0.0219 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.043   Deviance explained = 5.24%
GCV = 0.27033  Scale est. = 0.2673    n = 704

Family: gaussian 
Link function: identity 

Formula:
log(Weight + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.14805    0.03944  29.107  < 2e-16 ***
ManagementTabu open -0.06480    0.06473  -1.001    0.317    
MethodNet            0.57534    0.10475   5.493 5.56e-08 ***
MethodSpear          0.31332    0.06173   5.076 4.96e-07 ***
MethodTrolling      -0.09969    0.13294  -0.750    0.454    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                edf Ref.df     F p-value
s(Year):ManagementAlways open 1.742  2.021 1.809   0.161
s(Year):ManagementTabu open   1.770  2.118 0.758   0.406

R-sq.(adj) =  0.0714   Deviance explained = 8.14%
GCV = 0.44714  Scale est. = 0.44173   n = 704

Family: gaussian 
Link function: identity 

Formula:
log(CPUE + 1) ~ s(Year, k = 5, by = Management) + Management + 
    Method

Parametric coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.46682    0.02180  21.418  < 2e-16 ***
ManagementTabu open -0.09232    0.03157  -2.924  0.00357 ** 
MethodNet            0.25574    0.05789   4.418 1.16e-05 ***
MethodSpear          0.18187    0.03409   5.335 1.29e-07 ***
MethodTrolling       0.04806    0.07345   0.654  0.51306    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                               edf Ref.df     F p-value
s(Year):ManagementAlways open 1.82  2.092 1.813   0.154
s(Year):ManagementTabu open   1.00  1.000 0.922   0.337

R-sq.(adj) =  0.0725   Deviance explained = 8.15%
GCV = 0.13622  Scale est. = 0.1347    n = 704

#Plot model
Hours.visreg=visreg(Hours.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)
Weight.visreg=visreg(Weight.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)
CPUE.visreg=visreg(CPUE.gam1,"Year", by="Management" , partial=FALSE, overlay=TRUE, trans=exp)

Model plots

visreg(Hours.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)
visreg(Weight.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)
visreg(CPUE.gam1,"Year", by="Management" , partial=TRUE, overlay=TRUE, trans=exp)

Model plots with Raw Data

#Plot model with Means overlaid, calculated outside of R
Means.alldata<- read.csv('reeffishMeans.csv', header= T, strip.white= T)
Means.alldata=Means.alldata%>% mutate(Year=as.integer(Year))

#Hours figure
Hours.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=Hours, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= Hours.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=Hours.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= Hours-HoursSE, ymax= Hours+HoursSE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('Fishing Effort (Hours)'), limits=c(2,8), breaks=c(2,4,6,8)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

#Weight figure
Weight.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=Weight, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= Weight.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=Weight.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= Weight-WeightSE, ymax= Weight+WeightSE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('Catch (Kg)'), limits=c(0,8), breaks=c(2,4,6,8)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

#CPUE figure
CPUE.reeffish.gam= ggplot(data=Means.alldata, aes(x=Year, y=CPUE, fill=factor(Management), color=factor(Management)))+
  geom_ribbon(data= CPUE.visreg$fit, aes(x=Year, ymin=visregLwr, ymax=visregUpr), alpha=0.5, linetype=1, size=0.2, colour=NA) +
  geom_line(data=CPUE.visreg$fit, aes(x=Year,y=visregFit, group=Management, color=Management), lwd=1) +
  geom_pointrange(aes(ymin= CPUE-CPUESE, ymax= CPUE+CPUESE), width=.5, size=0.5, position=position_dodge(.05))+
  scale_colour_manual(values=c('deepskyblue4', 'coral2'))+
  scale_fill_manual(values=c('deepskyblue3', 'coral1'))+
  scale_y_continuous(name=expression('CPUE'), limits=c(0,2), breaks=c(0,.5,1,1.5,2)) +
  scale_x_continuous(name=expression('Year'), limits=c(2009,2020), breaks=c(2010,2012,2014,2016,2018,2020))+
  theme_classic()+
  theme(axis.title.y= element_text(size=14), axis.text.y= element_text(size=12), axis.text.x= element_text(size=12), axis.title.x= element_text(size=14), legend.text=element_text(size=12), legend.title=element_text(size=14))

Hours.reeffish.gam
Weight.reeffish.gam
CPUE.reeffish.gam

Model plots with Means of data

【问题讨论】：

标签： r ggplot2 model gam

【解决方案1】：

我已经发现了这个问题。因为数据经过 log(x+1) 变换，但仅通过 trans=exp 进行反向变换，模型预测全部减一。这仅显示为 CPUE 的显着差异，因为它们的单位要小得多。

【讨论】：