R中按组建立线性回归模型和预测答案

【问题标题】：Linear Regression model building and prediction by group in RR中按组建立线性回归模型和预测
【发布时间】：2020-12-24 03:21:35
【问题描述】：

我正在尝试基于子集（组）构建多个模型并生成它们的拟合。换句话说，考虑到我在下面的尝试，我正在尝试构建特定于国家/地区的模型。不幸的是，在我的尝试中，我只能考虑整个数据集来构建模型，而不是将其限制在数据集中的国家组中。你能帮我解决这个问题吗？

在第一种情况下，我正在做某种交叉验证来生成预测。在第二种情况下，我不是。我的两次尝试似乎都失败了。



library(modelr)
install.packages("gapminder")
library(gapminder)                           
data(gapminder) 

#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder)

models <- list(fit_model1 = model1,fit_model2 = model2)

gapminder %>% group_by(continent, country) %>%
  bind_cols(
    map(1:nrow(gapminder), function(i) {
      map_dfc(models, function(model) {
        training <- gapminder[-i, ] 
        fit <- lm(model, data = training)
        
        validation <- gapminder[i, ]
        predict(fit, newdata = validation)
        
      })
    }) %>%
      bind_rows()
  )


#CASE 2
model1 <- lm(lifeExp ~ pop, data = gapminder)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder)

models <- list(fit_model1 = model1,fit_model2 = model2)


for (m in names(models)) {
  gapminder[[m]] <- predict(models[[m]], gapminder %>% group_by(continent, country) )
  
}

【问题讨论】：

标签： r group-by regression

【解决方案1】：

按组建模的 tidyverse 解决方案是使用：

tidyr::nest() 对变量进行分组
dplyr::mutate() 与 purrr::map() 一起按组创建模型
broom::tidy() 或 broom::augment() 生成模型摘要和预测
tidyr::unnest() 和 dplyr::filter() 按组获取摘要和预测

这是一个例子。它与您问题中的代码不同，但我认为它仍然会有所帮助。

此代码按国家/地区生成线性模型lifeExp ~ pop 以及每个模型的拟合（预测）值。

library(tidyverse)
library(broom)
library(gapminder)

gapminder_lm <- gapminder %>% 
  nest(data = c(year, lifeExp, pop, gdpPercap)) %>% 
  mutate(model = map(data, ~lm(lifeExp ~ pop, .)), 
         fitted = map(model, augment)) %>% 
  unnest(fitted)

gapminder_lm

# A tibble: 1,704 x 12
   country     continent data              model  lifeExp      pop .fitted .resid .std.resid   .hat .sigma  .cooksd
   <fct>       <fct>     <list>            <list>   <dbl>    <int>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
 1 Afghanistan Asia      <tibble [12 x 4]> <lm>      28.8  8425333    33.2 -4.41     -1.54   0.182    2.92 0.262   
 2 Afghanistan Asia      <tibble [12 x 4]> <lm>      30.3  9240934    33.7 -3.35     -1.15   0.161    3.11 0.128   
 3 Afghanistan Asia      <tibble [12 x 4]> <lm>      32.0 10267083    34.3 -2.27     -0.773  0.139    3.24 0.0482  
 4 Afghanistan Asia      <tibble [12 x 4]> <lm>      34.0 11537966    35.0 -0.985    -0.331  0.116    3.32 0.00720 
 5 Afghanistan Asia      <tibble [12 x 4]> <lm>      36.1 13079460    35.9  0.193     0.0641 0.0969   3.34 0.000220
 6 Afghanistan Asia      <tibble [12 x 4]> <lm>      38.4 14880372    36.9  1.50      0.496  0.0849   3.30 0.0114  
 7 Afghanistan Asia      <tibble [12 x 4]> <lm>      39.9 12881816    35.8  4.07      1.35   0.0989   3.02 0.101   
 8 Afghanistan Asia      <tibble [12 x 4]> <lm>      40.8 13867957    36.4  4.47      1.48   0.0902   2.95 0.108   
 9 Afghanistan Asia      <tibble [12 x 4]> <lm>      41.7 16317921    37.8  3.91      1.29   0.0838   3.05 0.0759  
10 Afghanistan Asia      <tibble [12 x 4]> <lm>      41.8 22227415    41.2  0.588     0.202  0.157    3.33 0.00380 
# ... with 1,694 more rows

这样做的好处是将所有内容保存在一个整洁的数据框中，可以过滤出感兴趣的数据。

例如，过滤埃及并绘制真实值与预测值：

gapminder_lm %>% 
  filter(country == "Egypt") %>% 
  ggplot(aes(lifeExp, .fitted)) + 
  geom_point() + 
  labs(title = "Egypt")

【讨论】：

这很有帮助，但我相信我确实需要将它作为循环的一部分运行以进行一些引导（留下一个）。或者有没有办法用 tidyverse 包做到这一点？
如果使用nest_by，你能帮我修复下面的脚本吗：model1 % group_by(大陆, 国家) %>% bind_cols( map(1:nrow(gapminder), function(i) { map_dfc(models, function( model) { training % bind_rows()