有下标时从公式中提取变量答案

【问题标题】：Extracting variables from a formula when there are subscripts有下标时从公式中提取变量
【发布时间】：2015-08-26 13:36:47
【问题描述】：

有几篇文章与在 R 中获取回归公式中的变量列表相关 - 基本答案是使用 all.vars。例如，

> all.vars(log(resp) ~ treat + factor(dose))
[1] "resp"  "treat" "dose"

这很好，因为它去掉了所有的函数和运算符（以及重复，未显示）。但是，当公式中包含$ 运算符或下标时，这是有问题的，例如在

> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> all.vars(form)
[1] "cows"   "weight" "bulls"  "herd"   "breed"

这里将数据框名称cows、bulls、herd标识为变量，实际变量的名称解耦或丢失。相反，我真正想要的是这样的结果：

> mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed"

最优雅的方法是什么？我有一个建议作为答案发布，但也许有人有更优雅的解决方案并且会赢得更多选票！

【问题讨论】：

嗯，带有$ 和[[ 的公式在使用时非常有问题，应该避免使用。您认为有哪些必要的场景是什么？如果我有~x[[y]] 和y<-"p" 会怎样。这个函数会返回什么？
我同意应该避免使用它们。但我是一个包开发者，一些用户会适应我所展示的模型（虽然通常不是那么极端）。

标签： r string parsing

【解决方案1】：

一种可行的方法，虽然有点乏味，是用合法字符替换变量名称的运算符$ 等，将字符串转换回公式，应用all.vars，然后解开结果：

All.vars = function(expr, retain = c("\\$", "\\[\\[", "\\]\\]"), ...) {
    # replace operators with unlikely patterns _Av1_, _Av2_, ...
    repl = paste("_Av", seq_along(retain), "_", sep = "")
    for (i in seq_along(retain))
        expr = gsub(retain[i], repl[i], expr)
    # piece things back together in the right order, and call all.vars
    subs = switch(length(expr), 1, c(1,2), c(2,1,3))
    vars = all.vars(as.formula(paste(expr[subs], collapse = "")), ...)
    # reverse the mangling of names
    retain = gsub("\\\\", "", retain)  # un-escape the patterns
    for (i in seq_along(retain))
        vars = gsub(repl[i], retain[i], vars)
    vars
}

使用retain 参数指定我们希望保留而不是视为运算符的模式。默认值为$、[[ 和]]（均已适当转义）以下是一些结果：

> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> All.vars(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed"

将retain 更改为也包括( 和)：

> All.vars(form, retain = c("\\$", "\\(", "\\)", "\\[\\[", "\\]\\]"))
[1] "log(cows$weight)"   "factor(bulls[[3]])" "herd$breed"

点被传递给all.vars，这实际上与all.names 相同，但默认值不同。所以我们也可以获取retain中没有的函数和运算符：

> All.vars(form, functions = TRUE)
[1] "~"           "log"         "cows$weight" "*"          
[5] "factor"      "bulls[[3]]"  "herd$breed"

【讨论】：

【解决方案2】：

这对于一般用例来说是不够的，但只是为了好玩，我想我会尝试一下：

mystery.fcn = function(string) {
  string = gsub(":", " ", string)
  string = unlist(strsplit(gsub("\\b.*\\b\\(|\\(|\\)|[*~+-]", "", string), split=" "))
  string = string[nchar(string) > 0]
  return(string)
}

form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed" 

form1 = ~x[[y]]
mystery.fcn(form1)
[1] "x[[y]]"

form2 = z$three ~ z$one + z$two - z$x_y
mystery.fcn(form2)
[1] "z$three" "z$one"   "z$two"   "z$x_y"  

form3 = z$three ~ z$one:z$two
mystery.fcn(form3)
[1] "z$three" "z$one"   "z$two"

【讨论】：

很好，但它也需要处理交互操作符:。我尝试在括号中的表达式中添加:，但它不能正常工作。
我在函数中添加了一个新行来处理这种情况。实际上，以这种方式处理其他一些情况（例如，x$y*x$z）可能也是有意义的。