【问题标题】:Multiplying multiple columns with each other into a new dataframe in R将多列相互乘以R中的新数据框
【发布时间】:2021-05-06 13:34:19
【问题描述】:

我想将我的许多二进制变量乘以新列,即所谓的交互变量。我的数据集结构如下:

YearCountry <- data.frame( Time = c("2000","2001", "2002", "2003",
                           "2000","2001", "2002", "2003",
                           "2000","2001", "2002", "2003"),
                  AL = c(1,1,1,1,0,0,0,0,0,0,0,0),
                  FR = c(0,0,0,0,1,1,1,1,0,0,0,0),
                  UK = c(0,0,0,0,0,0,0,0,1,1,1,1),
                  Y2000d = c(1,0,0,0,1,0,0,0,1,0,0,0),
                  Y2001d = c(0,1,0,0,0,1,0,0,0,1,0,0),
                  Y2002d = c(0,0,1,0,0,0,1,0,0,0,1,0),
                  Y2003d = c(0,0,0,1,0,0,0,1,0,0,0,1))
YearCountry

 Time AL FR UK Y2000d Y2001d Y2002d Y2003d
1  2000  1  0  0      1      0      0      0
2  2001  1  0  0      0      1      0      0
3  2002  1  0  0      0      0      1      0
4  2003  1  0  0      0      0      0      1
5  2000  0  1  0      1      0      0      0
6  2001  0  1  0      0      1      0      0
7  2002  0  1  0      0      0      1      0
8  2003  0  1  0      0      0      0      1
9  2000  0  0  1      1      0      0      0
10 2001  0  0  1      0      1      0      0
11 2002  0  0  1      0      0      1      0
12 2003  0  0  1      0      0      0      1

我需要将每个国家(AL、FR、UK)的二进制变量与给定年份的每个二进制变量相乘,以便得到#country x #year 新变量。在这种情况下,我有 3 个国家和 4 年,这给出了 12 个新变量。我的完整数据包含 105 个国家/地区,时间跨度超过 20 年。因此,我需要一个通用公式。我想要看起来像这样的数据

Interact <- data.frame(Time = c("2000","2001", "2002", "2003",
                                "2000","2001", "2002", "2003",
                                "2000","2001", "2002", "2003"),
                       Y2000xAL = c(1,0,0,0,0,0,0,0,0,0,0,0),
            Y2001xAL = c(0,1,0,0,0,0,0,0,0,0,0,0),
            Y2002xAL = c(0,0,1,0,0,0,0,0,0,0,0,0),
            Y2003xAL = c(0,0,0,1,0,0,0,0,0,0,0,0),
            Y2000xFR = c(0,0,0,0,1,0,0,0,0,0,0,0),
            Y2001xFR = c(0,0,0,0,0,1,0,0,0,0,0,0),
            Y2002xFR = c(0,0,0,0,0,0,1,0,0,0,0,0),
            Y2003xFR = c(0,0,0,0,0,0,0,1,0,0,0,0),
            Y2000xUk = c(0,0,0,0,0,0,0,0,1,0,0,0),
            Y2001xUK = c(0,0,0,0,0,0,0,0,0,1,0,0),
            Y2002xUK = c(0,0,0,0,0,0,0,0,0,0,1,0),
            Y2003xUK = c(0,0,0,0,0,0,0,0,0,0,0,1))
Interact 

 Time Y2000xAL Y2001xAL Y2002xAL Y2003xAL Y2000xFR Y2001xFR Y2002xFR Y2003xFR Y2000xUk Y2001xUK Y2002xUK Y2003xUK
1  2000        1        0        0        0        0        0        0        0        0        0        0        0
2  2001        0        1        0        0        0        0        0        0        0        0        0        0
3  2002        0        0        1        0        0        0        0        0        0        0        0        0
4  2003        0        0        0        1        0        0        0        0        0        0        0        0
5  2000        0        0        0        0        1        0        0        0        0        0        0        0
6  2001        0        0        0        0        0        1        0        0        0        0        0        0
7  2002        0        0        0        0        0        0        1        0        0        0        0        0
8  2003        0        0        0        0        0        0        0        1        0        0        0        0
9  2000        0        0        0        0        0        0        0        0        1        0        0        0
10 2001        0        0        0        0        0        0        0        0        0        1        0        0
11 2002        0        0        0        0        0        0        0        0        0        0        1        0
12 2003        0        0        0        0        0        0        0        0        0        0        0        1

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    这是dplyr::across 的一种方法。我们可以使用purrr:invoke 将最终结果制作成一个普通的data.frame,如this answer 所示。

    library(dplyr)
    library(purrr)
    YearCountry %>% 
        mutate(across(AL:UK, ~ . * select(cur_data(), Y2000d:Y2003d))) %>%
        select(-(Y2000d:Y2003d)) %>% 
        invoke(.f = data.frame) %>%
        rename_with(~str_replace(.,"\\.",""))
       Time ALY2000d ALY2001d ALY2002d ALY2003d FRY2000d FRY2001d FRY2002d FRY2003d UKY2000d UKY2001d UKY2002d UKY2003d
    1  2000         1         0         0         0         0         0         0         0         0         0         0         0
    2  2001         0         1         0         0         0         0         0         0         0         0         0         0
    3  2002         0         0         1         0         0         0         0         0         0         0         0         0
    4  2003         0         0         0         1         0         0         0         0         0         0         0         0
    5  2000         0         0         0         0         1         0         0         0         0         0         0         0
    6  2001         0         0         0         0         0         1         0         0         0         0         0         0
    7  2002         0         0         0         0         0         0         1         0         0         0         0         0
    8  2003         0         0         0         0         0         0         0         1         0         0         0         0
    9  2000         0         0         0         0         0         0         0         0         1         0         0         0
    10 2001         0         0         0         0         0         0         0         0         0         1         0         0
    11 2002         0         0         0         0         0         0         0         0         0         0         1         0
    12 2003         0         0         0         0         0         0         0         0         0         0         0         1
    

    【讨论】:

    • 这可行,但我认为列名存在问题,因为当我尝试使用 library(haven) 将其写入 DTA 文件 (stata) 时,我收到错误“错误:列类型列表尚不支持”
    • 使用purrr::invoke 应该可以解决问题。
    • purrr 已安装,但使用 Haven 写入时出现新错误:错误:无法创建列 AL.Y2000d:提供的名称包含非法字符。有没有办法删除所有的“。”来自列名?
    • 当然,您可以使用rename_with。检查编辑。
    【解决方案2】:

    1) model.matrix 我们将名称按字符数进行拆分(国家名称中有 2 个字符,年份有 6 个),并在每个名称中粘贴加号。 (交替使用Plus(grep("^..$", nms, value = TRUE)) 来获取国家名称并使用它来代替spl["2"],类似地使用Plus(grep("^Y....d$", nms, value = TRUE)) 代替spl["6"]。)

    c(`2` = "AL+FR+UK", `6` = "Y2000d+Y2001d+Y2002d+Y2003d")
    

    由此得出公式:

    ~(AL + FR + UK):(Y2000d + Y2001d + Y2002d + Y2003d) + 0
    

    然后计算它的模型矩阵。

    也可以通过修改sprintf 格式将公式扩展为lm 接受的公式,因此我们甚至可能不需要创建模型矩阵。例如,如果我们有一个响应向量 R,那么我们可以写成:s &lt;- sprintf("R ~ (%s)*(%s)", spl["2"], spl["4"]); fo &lt;- formula(s); lm(fo, YearCountry) 以包含所有变量以及国家和年份的相互作用以及截距。

    Plus <- function(x) paste(x, collapse = "+")
    nms <- names(YearCountry)[-1]
    spl <- sapply(split(nms, nchar(nms)), Plus)
    
    s <- sprintf("~ (%s):(%s)+0", spl["2"], spl["6"])
    fo <- formula(s)
    
    model.matrix(fo, YearCountry)
    

    给出这个矩阵:

       AL:Y2000d AL:Y2001d AL:Y2002d AL:Y2003d FR:Y2000d FR:Y2001d FR:Y2002d FR:Y2003d UK:Y2000d UK:Y2001d UK:Y2002d UK:Y2003d
    1          1         0         0         0         0         0         0         0         0         0         0         0
    2          0         1         0         0         0         0         0         0         0         0         0         0
    3          0         0         1         0         0         0         0         0         0         0         0         0
    4          0         0         0         1         0         0         0         0         0         0         0         0
    5          0         0         0         0         1         0         0         0         0         0         0         0
    6          0         0         0         0         0         1         0         0         0         0         0         0
    7          0         0         0         0         0         0         1         0         0         0         0         0
    8          0         0         0         0         0         0         0         1         0         0         0         0
    9          0         0         0         0         0         0         0         0         1         0         0         0
    10         0         0         0         0         0         0         0         0         0         1         0         0
    11         0         0         0         0         0         0         0         0         0         0         1         0
    12         0         0         0         0         0         0         0         0         0         0         0         1
    attr(,"assign")
     [1]  1  2  3  4  5  6  7  8  9 10 11 12
    

    或者我们可以像这样紧凑地写它:

    Plus <- function(x) paste(x, collapse = "+")
    nms <- names(YearCountry)
    s <- sprintf("~ (%s):(%s)+0", Plus(nms[2:4]), Plus(nms[5:8]))
    fo <- formula(s)
    model.matrix(fo, YearCountry)
    

    2) eList 另一种方法是使用列表推导。使用 eList 包,我们可以做到这一点:

    library(eList)
    DF(for(i in YearCountry[2:4]) for(j in YearCountry[5:8]) i*j)
    

    给出这个数据框。如果你想要一个矩阵,请使用as.matrix(...)

       AL.Y2000d AL.Y2001d AL.Y2002d AL.Y2003d FR.Y2000d FR.Y2001d FR.Y2002d FR.Y2003d UK.Y2000d UK.Y2001d UK.Y2002d UK.Y2003d
    1          1         0         0         0         0         0         0         0         0         0         0         0
    2          0         1         0         0         0         0         0         0         0         0         0         0
    3          0         0         1         0         0         0         0         0         0         0         0         0
    4          0         0         0         1         0         0         0         0         0         0         0         0
    5          0         0         0         0         1         0         0         0         0         0         0         0
    6          0         0         0         0         0         1         0         0         0         0         0         0
    7          0         0         0         0         0         0         1         0         0         0         0         0
    8          0         0         0         0         0         0         0         1         0         0         0         0
    9          0         0         0         0         0         0         0         0         1         0         0         0
    10         0         0         0         0         0         0         0         0         0         1         0         0
    11         0         0         0         0         0         0         0         0         0         0         1         0
    12         0         0         0         0         0         0         0         0         0         0         0         1
    

    3) listcompr listcompr 是另一个列表解析包。请注意,需要此软件包的开发版本才能使用bycol=。如果需要数据框,请将 gen.named.matrix 替换为 gen.named.data.frame

    # devtools::github_github("patrickroocks/listcompr")
    library(listcompr)
    
    nms <- names(YearCountry)
    gen.named.matrix("{nms[i]}.{nms[j]}", YearCountry[[i]] * YearCountry[[j]],
      i = 2:4, j = 5:8, bycol = TRUE)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-03-14
      • 2014-02-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-05
      相关资源
      最近更新 更多