【问题标题】:R Studio: Match first n characters between two columns, and fill in value from another columnR Studio:匹配两列之间的前n个字符,并从另一列填充值
【发布时间】:2021-05-04 00:21:38
【问题描述】:

我有一个如下所示的数据框“city_table”:

+---+---------------------+
|   | city                |
+---+---------------------+
| 1 | Chicago-2234dxsw    |
+---+---------------------+
| 2 | Chicago,IL          |
+---+---------------------+
| 3 | Chicago             |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx      |
+---+---------------------+
| 6 | Atlanta- 1234xx     |
+---+---------------------+
| 7 | Atlanta, GA         |
+---+---------------------+
| 8 | Atlanta - 123456T   |
+---+---------------------+

我有另一个城市代码查找表“city_lookup”,如下所示:

+---+--------------+-----------+
|   | city_name    | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL  | 001       |
+---+--------------+-----------+
| 2 | Atlanta, GA  | 002       |
+---+--------------+-----------+

如您所见,“city”中的城市名称混乱且格式不同,而“city_code”中的城市名称遵循统一格式(city,STATE)。

我希望决赛桌通过匹配city_table$citycity_lookup$city_name 之间的前n 个字符(让我们看看,n=7),返回城市代码 正确的,像这样:

+---+---------------------+-----------+
|   | city_name           | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw    | 001       |
+---+---------------------+-----------+
| 2 | Chicago,IL          | 001       |
+---+---------------------+-----------+
| 3 | Chicago             | 001       |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001       |
+---+---------------------+-----------+
| 5 | Chicago_2133xx      | 001       |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx     | 002       |
+---+---------------------+-----------+
| 7 | Atlanta, GA         | 002       |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T   | 002       |
+---+---------------------+-----------+

我在 R 中执行此操作,最好使用 tidyverse/dplyr。非常感谢您的帮助!

【问题讨论】:

    标签: r dplyr tidyverse


    【解决方案1】:

    更好的是,只要城市全名后面的字符都是非字母的,就可以这样匹配整个城市名:

    city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
    city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
                          city_code = c("001", "002"))
    
    
    city_table %>%
      mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
      left_join(city_lookup %>%
                  mutate(city_clean  = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
                by = "city_clean") %>%
      select(-city_clean, -city_name)
    
    
      city              city_code
      <chr>             <chr>    
    1 Chicago-2234dxsw  001      
    2 Chicago,IL        001      
    3 Atlanta - 123456T 002 
    

    【讨论】:

      【解决方案2】:

      我们可以使用 substring 创建列(正如问题中的 OP 所问),然后执行 regex_left_join

      library(dplyr)
      library(fuzzyjoin)
      city_table %>%
         mutate(city_sub = substring(city, 1, 7)) %>%
         regex_left_join(city_lookup %>%
                           mutate(city_sub = substring(city_name, 1, 7)), 
                   by = 'city_sub')  %>%
         select(city_name = city, city_code)
      

      -输出

      #             city_name city_code
      #1    Chicago-2234dxsw       001
      #2          Chicago,IL       001
      #3             Chicago       001
      #4 Chicago - 124421xsd       001
      #5      Chicago_2133xx       001
      #6     Atlanta- 1234xx       002
      #7         Atlanta, GA       002
      #8   Atlanta - 123456T       002
      

      数据

      city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago", 
      "Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA", 
      "Atlanta - 123456T")), class = "data.frame", row.names = c(NA, 
      -8L))
      
      city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"), 
      city_code = c("001", 
      "002")), class = "data.frame", row.names = c(NA, -2L))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2020-11-26
        • 1970-01-01
        • 1970-01-01
        • 2021-08-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多