为列表列数据框的每一行拟合不同的模型答案

【问题标题】：Fit a different model for each row of a list-columns data frame为列表列数据框的每一行拟合不同的模型
【发布时间】：2017-05-15 05:23:38
【问题描述】：

用 tidyverse 中的 list-columns 数据结构来拟合因数据框的行而异的不同模型公式的最佳方法是什么？

在 R for Data Science 中，Hadley 提供了一个极好的示例，说明如何使用列表列数据结构并轻松拟合许多模型 (http://r4ds.had.co.nz/many-models.html#gapminder)。我试图找到一种方法来拟合许多公式略有不同的模型。在下面改编自他的原始示例的示例中，为每个大陆拟合不同模型的最佳方法是什么？

library(gapminder)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)

by_continent <- gapminder %>% 
  group_by(continent) %>% 
  nest()

by_continent <- by_continent %>% 
  mutate(model = map(data, ~lm(lifeExp ~ year, data = .)))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma statistic      p.value    df
#     <fctr>     <dbl>         <dbl>     <dbl>     <dbl>        <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419  304.1298 6.922751e-51     2
#2    Europe 0.4984659     0.4970649 3.8530964  355.8099 1.344184e-55     2
#3    Africa 0.2987543     0.2976269 7.6685811  264.9929 6.780085e-50     2
#4  Americas 0.4626467     0.4608435 6.8618439  256.5699 4.354220e-42     2
#5   Oceania 0.9540678     0.9519800 0.8317499  456.9671 3.299327e-16     2
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
##   deviance <dbl>, df.residual <int>

我知道我可以通过遍历 by_continent 来做到这一点（效率不高，因为它估计每个大陆的每个模型：

formulae <- list(
  Asia=~lm(lifeExp ~ year, data = .),
  Europe=~lm(lifeExp ~ year + pop, data = .),
  Africa=~lm(lifeExp ~ year + gdpPercap, data = .),
  Americas=~lm(lifeExp ~ year - 1, data = .),
  Oceania=~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)

for (i in 1:nrow(by_continent)) {
  by_continent$model[[i]] <- map(by_continent$data, formulae[[i]])[[i]]
}

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma  statistic       p.value    df
#     <fctr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2
#2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3
#3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3
#4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1
#5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
##   deviance <dbl>, df.residual <int>

但是是否可以在不跟随基础 R 中的循环的情况下执行此操作（并避免拟合我不需要的模型）？

我尝试的是这样的：

by_continent <- by_continent %>% 
left_join(tibble::enframe(formulae, name="continent", value="formula"))

by_continent %>% 
   mutate(model=map2(data, formula, est_model))

但我似乎无法想出一个有效的 est_model 函数。我试过这个不起作用的功能（h/t：https://gist.github.com/multidis/8138757）：

  est_model <- function(data, formula, ...) {
  mc <- match.call()
  m <- match(c("formula","data"), names(mc), 0L)
  mf <- mc[c(1L, m)]
  mf[[1L]] <- as.name("model.frame")
  mf <- eval(mf, parent.frame())
  data.st <- data.frame(mf)

  return(data.st)
}

（诚然，这是一个人为的例子。我的实际情况是，我的数据中缺少关键自变量的大量观察结果，所以我想在一个模型中拟合一个包含所有变量的完整观察值，而另一个模型只拟合一个子集其余观察的变量。）

更新

我想出了一个可行的 est_model 函数（尽管可能效率不高）：

est_model <- function(data, formula, ...) {
  map(list(data), formula, ...)[[1]]
}

by_continent <- by_continent %>% 
   mutate(model=map2(data, formula, est_model))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma  statistic       p.value    df
#      <chr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2
#2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3
#3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3
#4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1
#5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
##   df.residual <int>

【问题讨论】：

标签： r dplyr tidyr tidyverse

【解决方案1】：

我发现purrr::modify_depth() 可以在原始问题中使用est_model() 做我想做的事情。这是我现在满意的解决方案：

library(gapminder)
library(tidyverse)
library(purrr)
library(broom)

fmlas <- tibble::tribble(
  ~continent, ~formula,
  "Asia", ~lm(lifeExp ~ year, data = .),
  "Europe", ~lm(lifeExp ~ year + pop, data = .),
  "Africa", ~lm(lifeExp ~ year + gdpPercap, data = .),
  "Americas", ~lm(lifeExp ~ year - 1, data = .),
  "Oceania", ~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)

by_continent <- gapminder %>% 
  nest(-continent) %>%
  left_join(fmlas) %>%
  mutate(model=map2(data, formula, ~modify_depth(.x, 0, .y)))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

【讨论】：

【解决方案2】：

我发现列出模型公式更容易。每个模型只适合对应的continent。我在嵌套数据中添加了一个新列formula，以确保formula 和continent 的顺序相同，以防万一。

formulae <- c(
    Asia= lifeExp ~ year,
    Europe= lifeExp ~ year + pop,
    Africa= lifeExp ~ year + gdpPercap,
    Americas= lifeExp ~ year - 1,
    Oceania= lifeExp ~ year + pop + gdpPercap
)

df <- gapminder %>%
    group_by(continent) %>%
    nest() %>%
    mutate(formula = formulae[as.character(continent)]) %>%
    mutate(model = map2(formula, data, ~ lm(.x, .y))) %>%
    mutate(glance=map(model, glance)) %>%
    unnest(glance, .drop=T)

# # A tibble: 5 × 12
#   continent r.squared adj.r.squared     sigma  statistic       p.value    df      logLik        AIC        BIC
#      <fctr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>       <dbl>      <dbl>      <dbl>
# 1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2 -1427.65947 2861.31893 2873.26317
# 2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3  -995.41016 1998.82033 2014.36475
# 3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3 -2098.46089 4204.92179 4222.66639
# 4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1 -1083.35918 2170.71836 2178.12593
# 5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4   -22.06696   54.13392   60.02419
# # ... with 2 more variables: deviance <dbl>, df.residual <int>

【讨论】：

这可能是比我想出的更好的解决方案。也可以使用 modelr::formulas() 创建公式，尽管我更喜欢将建模函数 ("lm") 保留在公式中，因此它会显示在 summary() 调用中。谢谢！