使用 case_when() 分配两个新列，而不是一个答案

【问题标题】：Using case_when() to assign two new columns, instead of one使用 case_when() 分配两个新列，而不是一个
【发布时间】：2019-01-28 17:38:45
【问题描述】：

我有这个样本数据：

df <- tibble(
  "City1" = c("New York", "Boston", "Chicago"),
  "City2" = c("Chicago", "Cleveland", "Atlanta"))

假设City1 是起点，City2 是终点。即，一个人从纽约前往芝加哥。

我想为起始纬度添加一列，为起始经度添加一列，并为目的地城市做同样的事情。总之，我想要四个新专栏。我已经有了坐标。

如何分配坐标？我曾尝试使用case_when，但我不确定如何将坐标传送到多个列。做一栏很容易：

library(tidyverse)

# The numbers after the cities are the latitudes
df <- df %>% 
  mutate(
   City1_lat = case_when(
    City1 == 'New York' ~ 40.7128,
    City1 == 'Boston' ~ 42.3601,
    City1 == 'Chicago' ~ 41.8781
  )
 )

如何扩展它以添加到 City1_lon 列中？尝试尽可能简化这一点，因为我有几千行起点/终点。 dplyr 或 base 解决方案都有效。我会将其扩展到目的地城市City2。供参考：

New York: 40.7128, 74.0060
Boston: 42.3601, 71.0589
Chicago: 41.8781, 87.6298
Cleveland: 41.4993, 81.6944
Atlanta: 33.7490, 84.3880

【问题讨论】：

标签： r dplyr

【解决方案1】：

在这样的数据框中使用您的城市数据：

> city
       City     lat    long
1  New York 40.7128 74.0060
2    Boston 42.3601 71.0589
3   Chicago 41.8781 87.6298
4 Cleveland 41.4993 81.6944
5   Atlanta 33.7490 84.3880

使用match 在表格中查找城市名称，提取经纬度，然后重命名：

> setNames(city[match(df$City1, city$City), c("lat","long")],c("City1lat","City1long"))
  City1lat City1long
1  40.7128   74.0060
2  42.3601   71.0589
3  41.8781   87.6298

> setNames(city[match(df$City2, city$City), c("lat","long")],c("City2lat","City2long"))
  City2lat City2long
3  41.8781   87.6298
4  41.4993   81.6944
5  33.7490   84.3880

您可以cbind 到您的原始数据上：

> df = cbind(df, setNames(city[match(df$City1, city$City), c("lat","long")],c("City1lat","City1long")), setNames(city[match(df$City2, city$City), c("lat","long")],c("City2lat","City2long")))
> df
     City1     City2 City1lat City1long City2lat City2long
1 New York   Chicago  40.7128   74.0060  41.8781   87.6298
2   Boston Cleveland  42.3601   71.0589  41.4993   81.6944
3  Chicago   Atlanta  41.8781   87.6298  33.7490   84.3880

【讨论】：

【解决方案2】：

一种选择是在创建“keyval”数据集后执行left_join

library(tidyverse)
map_dfc(names(df), ~  df %>% 
                        select(.x) %>% 
                        left_join(keyval, by = setNames('City', .x))) %>%
    select(names(df), everything())  
# A tibble: 3 x 6
#  City1    City2       lat   lon  lat1  lon1
#  <chr>    <chr>     <dbl> <dbl> <dbl> <dbl>
#1 New York Chicago    40.7  74.0  41.9  87.6
#2 Boston   Cleveland  42.4  71.1  41.5  81.7
#3 Chicago  Atlanta    41.9  87.6  33.7  84.4

如果原始数据中有更多列，并且我们只对“城市”列感兴趣，则只循环遍历“城市”列

df$journeys <- (100,200,300)
nm1 <- grep("City", names(df), value = TRUE)
map_dfc(nm1, ~  df %>% 
                     select(.x) %>% 
                     left_join(keyval, by = setNames('City', .x))) %>%  
      bind_cols(df %>% 
                  select(-one_of(nm1)))

数据

keyval <- structure(list(City = c("New York", "Boston", "Chicago", "Cleveland", 
 "Atlanta"), lat = c(40.7128, 42.3601, 41.8781, 41.4993, 33.749
 ), lon = c(74.0068, 71.0589, 87.6298, 81.6944, 84.388)), row.names = c(NA, 
  -5L), class = c("tbl_df", "tbl", "data.frame"))

【讨论】：

keyval 集是否必须是实际数据集的长度？我正在处理数千行。所以显然更小的集合会很棒
@papelr 为了方便创建，州名可以从state.name获取
如果源数据中除了 City1 和 City2 之外还有其他列，这似乎会失败，例如 df$journeys=c(100,200,300)。
是的，但是健壮总是更好。我还没有弄清楚如何让你的代码只在两列上工作而不首先删除所有其他列。
@Spacedman 我猜失败的部分是最后一步select(names(df)，但我删除它然后附加其他列，应该可以正常工作

【解决方案3】：

这是一个 tidyverse 解决方案：

library(dplyr)
library(purrr)

df <- tibble(
  "City1" = c("New York", "Boston", "Chicago"),
  "City2" = c("Chicago", "Cleveland", "Atlanta"))


df <- df %>% 
  mutate(
    City1_coords = case_when(
      City1 == 'New York' ~ list(c(40.7128,74.0060)),
      City1 == 'Boston' ~ list(c(42.3601,71.0589)),
      City1 == 'Chicago' ~ list(c(41.8781,87.6298))
    )
  ) %>% 
  mutate(City1_lat = City1_coords %>% map_dbl(~ .x[1] ),
         City1_lon = City1_coords %>% map_dbl(~ .x[2] ))

【讨论】：

这样做了，得到以下错误：Error in mutate_impl(.data, dots) : Evaluation error: Result 125 is not a length 1 atomic vector. 我的变量类型错误吗？我能想到的。上面的示例完美运行，但仅适用于该示例。嗯
这是这个答案的问题：github.com/tidyverse/purrr/issues/337
很高兴你想到了！

【解决方案4】：

这是一种使用mutate_all 和unnest 的方法，还有一个用于命名列的额外技巧：

df %>% 
  mutate_all(funs(l = case_when(
      . == 'New York'  ~ list(tibble(at=40.7128, on=74.0060)),
      . == 'Boston'    ~ list(tibble(at=42.3601, on=71.0589)),
      . == 'Chicago'   ~ list(tibble(at=41.8781, on=87.6298)),
      . == 'Cleveland' ~ list(tibble(at=41.4993, on=81.6944)),
      . == 'Atlanta'   ~ list(tibble(at=33.7490, on=84.3880))
    )
  )) %>%
  unnest(.sep = "")

# # A tibble: 3 x 6
#      City1     City2 City1_lat City1_lon City2_lat City2_lon
#      <chr>     <chr>     <dbl>     <dbl>     <dbl>     <dbl>
# 1 New York   Chicago   40.7128   74.0060   41.8781   87.6298
# 2   Boston Cleveland   42.3601   71.0589   41.4993   81.6944
# 3  Chicago   Atlanta   41.8781   87.6298   33.7490   84.3880

这解决了“使用 case_when() 分配两个新列”。

为了解决一般问题，我建议使用基于左连接的解决方案，因为将键和值放在一个整洁的单独表中会更灵活。

【讨论】：

我觉得我是个盲人。黑客在哪里 - 但我非常喜欢这个解决方案
mutate_all 将创建名为 City1_l 和 City2_l 的列，然后 unnest ，感谢参数 sep='' 将通过将其与名为 'at 的 tibble 列连接来创建新列on。这就是黑客:)
一种更传统的方法是编写funs(gps = ... 和tibble(lat=40.7128, lon=74.0060)) 等，然后使用sep='_' 你会产生名为City1_gps_lat 等的列
啊啊啊啊我看到了

【解决方案5】：

您应该在外部调用一个文件（在我的示例中称为 data_xy），其中包含“城市、纬度和经度”的信息，然后您可以使用 left_join。试试这个代码：

library(dplyr)
library(purrr)
data_xy <- tibble(city = c("New York", "Boston", "Chicago", "Cleveland", "Atlanta"),
                  lat = c(40.7128, 42.3601, 41.8781, 41.4993, 33.7490),
                  lon = c(74.0060, 71.0589, 87.6298, 81.6944, 84.3880))


df <- tibble("City1" = c("New York", "Boston", "Chicago"),
             "City2" = c("Chicago", "Cleveland", "Atlanta"))

df_latlon <- map(names(df), ~ left_join(df %>% select(.x),  data_xy, 
                                        by= structure(names = .x, .Data = "city")) )
df_latlon

输出：

> df_latlon
[[1]]
# A tibble: 3 x 3
  City1      lat   lon
  <chr>    <dbl> <dbl>
1 New York  40.7  74.0
2 Boston    42.4  71.1
3 Chicago   41.9  87.6

[[2]]
# A tibble: 3 x 3
  City2       lat   lon
  <chr>     <dbl> <dbl>
1 Chicago    41.9  87.6
2 Cleveland  41.5  81.7
3 Atlanta    33.7  84.4

【讨论】：