将数据从一个数据帧附加到R（或stata）中的另一个数据帧答案

【问题标题】：Appending data from one data frame onto another data frame in R (or stata)将数据从一个数据帧附加到R（或stata）中的另一个数据帧
【发布时间】：2020-08-20 04:16:51
【问题描述】：

我目前在 R 中工作，但我也可以在 stata 中解决这个问题，提供一些帮助。

我有两个非常大的数据集。一个包含家庭及其位置，另一个包含按日期和位置的天气数据。我最终需要一个数据集，其中每一行都是一个家庭，并包含按位置与该家庭匹配的天气数据。在此数据集中，每一列都将标识该观察的日期。

为了简单起见，我在 R 中创建了三个示例数据框。

第一个模拟我的家庭数据：

  house.id location.id
1    10001           a
2    10002           b
3    10003           c
4    10004           c
5    10005           a

第二个模拟我的天气数据：

        date location.id temperature
1 2020-01-01           a          70
2 2020-01-01           b          71
3 2020-01-01           c          74
4 2020-01-02           a          61
5 2020-01-02           b          63
6 2020-01-02           c          61
7 2020-01-03           a          57
8 2020-01-03           b          50
9 2020-01-03           c          64

最后一个显示了我的最终目标是什么：

  house.id location.id 2020-01-01 2020-01-02 2020-01-03
1    10001           a         70         62         57
2    10002           b         71         63         50
3    10003           c         74         61         64
4    10004           c         74         61         64
5    10005           a         70         62         57

如您所见，每个家庭都从其位置 ID 中提取天气数据，并使用以日期命名的附加列（从第二个数据集中获取）附加。

显然我手动创建了第三个数据集，否则我不会在这里要求代码。我需要弄清楚如何从前两个数据集自动生成第三个数据集，以便我可以在两个更大的数据集上执行该过程。

任何帮助将不胜感激！

【问题讨论】：

在 Stata 中，这基本上是位置标识符上的 merge。从 Stata 的角度来看，对于大多数用途而言，宽布局可能是一个糟糕的选择。保持长久。

标签： r dataframe merge append stata

【解决方案1】：

首先你需要重塑宽。

使用看起来像这样的 data.table

library(data.table)
dd <- setDT(dd)
dd <- dcast(dd, location.id ~ date, value.var="temperature")

或者，使用基础 R：

dd <- reshape(dd, direction = "wide", idvar = "location.id", timevar = "date")

然后就可以合并了：

m <- merge(d, dd, by="location.id", all.x = T)
  location.id house.id 2020-01-01 2020-01-02 2020-01-03
1           a    10001         70         61         57
2           a    10005         70         61         57
3           b    10002         71         63         50
4           c    10003         74         61         64
5           c    10004         74         61         64

数据：

d <- read.table(text = "  house.id location.id
1    10001           a
2    10002           b
3    10003           c
4    10004           c
5    10005           a
                ",header=T)

dd <- read.table(text = "          date location.id temperature
1 2020-01-01           a          70
2 2020-01-01           b          71
3 2020-01-01           c          74
4 2020-01-02           a          61
5 2020-01-02           b          63
6 2020-01-02           c          61
7 2020-01-03           a          57
8 2020-01-03           b          50
9 2020-01-03           c          64
                ",header=T )

【讨论】：

【解决方案2】：

试着这样做

hh <- structure(list(house.id = 10001:10005, location.id = structure(c(1L, 
                                                                       2L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                           -5L))
temperature <- structure(list(date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
                                                 3L, 3L), .Label = c("01.01.2020", "02.01.2020", "03.01.2020"), class = "factor"), 
                              location.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
                                                        3L), .Label = c("a", "b", "c"), class = "factor"), temperature = c(70L, 
                                                                                                                           71L, 74L, 61L, 63L, 61L, 57L, 50L, 64L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                         -9L))

library(tidyverse)
temperature %>% 
  left_join(hh) %>% 
  pivot_wider(c(house.id, location.id),
              names_from = date,
              values_from = temperature) %>% 
  arrange(house.id)

【讨论】：

【解决方案3】：

将您的天气数据转换为宽格式并加入家庭数据。应该这样做：

library(tidyverse)

#set up the household dataset
household_data <-  tribble(~"house.id",~"location.id",
                           10001,"a",
                           10002,"b",
                           10003,"c",
                           10004,"c",
                           10005,"a")
#set up the weather dataset
weather_data <-  tribble(~"date", ~"location.id", ~"temperature",
                         "2020-01-01","a",70,
                         "2020-01-01","b",71,
                         "2020-01-01","c",74,
                         "2020-01-02","a",61,
                         "2020-01-02","b",63,
                         "2020-01-02","c",61,
                         "2020-01-03","a",57,
                         "2020-01-03","b",50,
                         "2020-01-03","c",64)

household_data %>%
  full_join(weather_data %>%
              pivot_wider(names_from = "date",
                          values_from = "temperature"), # converts to wide format
            by = "location.id") # joins the two data frames

# A tibble: 5 x 5
  house.id location.id `2020-01-01` `2020-01-02` `2020-01-03`
     <dbl> <chr>              <dbl>        <dbl>        <dbl>
1    10001 a                     70           61           57
2    10002 b                     71           63           50
3    10003 c                     74           61           64
4    10004 c                     74           61           64
5    10005 a                     70           61           57

但是我不知道如何在 Stata 中做到这一点！

【讨论】：