将因子级别与 R 中的模糊/部分字符串匹配相结合答案

【问题标题】：Combine factor levels with fuzzy/partial character string match in R将因子级别与 R 中的模糊/部分字符串匹配相结合
【发布时间】：2019-11-07 19:43:12
【问题描述】：

我有一个包含 DMA（指定市场区域）的数据集，但许多 DMA 显示为两个不同的级别，因为 DMA 被截断，例如DMA“Abilene-Sweetwater, TX”有时显示为“Abilene-Sweetw”

这是数据集的一个sn-p：

dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)

DMA.df <- data.frame(dma, cost)

DMA.df
dma cost
1              Abilene-Sweetw 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5              Albany-Schenec 0.32
6              Albany-Schenec 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

在 SO 和其他地方的搜索出现了展示如何手动将多个因素水平合并为一个的解决方案。显然我不想手动执行此操作。

我正在寻找一种方法来修复截断的 DMA 并将其转换为“完整”DMA（城市-...-，州）。一个可取之处是截断有一个模式 - 它在 14 个字母处截断。该解决方案需要匹配所有 14 个字符，因为许多 DMA以开头的名称相同（例如“Albany, GA”和“Albany-..., NY”）。

换句话说，我需要找到所有与完整 DMA 匹配的截断 DMA，并将截断的 DMA变成完整的 DMA。

示例 DF 应如下所示：

             dma cost
1      Abilene-Sweetwater, TX 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5 Albany-Schenectady-Troy, NY 0.32
6 Albany-Schenectady-Troy, NY 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

提前感谢您的任何建议。

【问题讨论】：

DMA.df$dma2 <- substring(DMF.df$dma, 1, 14) 我认为您只需要截断所有内容...然后您就找到了匹配项，对吗？
@cory - 感谢您的建议。将所有 DMA 截断为 14 个字母很容易，但我想将截断的 DMA 转换为“full”DMA。澄清一下，这是一个示例 DF，完整的 DF 包含超过 100M 行和 210 个 DMA（似乎所有长的行都被截断的匹配复制了）。

标签： r factors fuzzy-logic levels

【解决方案1】：

使用Base r:: substring 和merge 和dplyr::select 和mutate 的最简单解决方案：

#sample (and problematic) df with some DMAs truncated and others full-length
dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)


DMA.df <- data.frame(dma, cost, stringsAsFactors = FALSE)
                         dma cost
1              Abilene-Sweetw 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5              Albany-Schenec 0.32
6              Albany-Schenec 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

#create a column where ALL the DMAs are truncated to the same length
DMA.df <- DMA.df %>% 
  mutate(dma_truncated = substring(dma, 1, 13)) %>% 
  select(-dma) #drop the orginal 'DMA' column
cost dma_truncated
1 0.46 Abilene-Sweet
2 0.46 Abilene-Sweet
3 0.45    Albany, GA
4 0.45    Albany, GA
5 0.32 Albany-Schene
6 0.32 Albany-Schene
7 0.32 Albany-Schene
8 0.32 Albany-Schene

#Create a lookup table where the truncated DMA is paired with the full DMA
dma_master <- c("Abilene-Sweetwater, TX",  
                "Albany, GA", 
                "Albany-Schenectady-Troy, NY")
dma_truncated <- substring(dma_master, 1, 13)
DMA_lookup.df <- data.frame(dma_truncated, dma_master, stringsAsFactors = FALSE)

dma_truncated                  dma_master
1 Abilene-Sweet      Abilene-Sweetwater, TX
2    Albany, GA                  Albany, GA
3 Albany-Schene Albany-Schenectady-Troy, NY


#Use MERGE to create the desired column of 'DMA' in the original DF
full_DMA.df <- merge(DMA_lookup.df, DMA.df, by='dma_truncated') %>% 
  select(-dma_truncated) #drop the truncated DMA column

dma_master cost
1      Abilene-Sweetwater, TX 0.46
2      Abilene-Sweetwater, TX 0.46
3 Albany-Schenectady-Troy, NY 0.32
4 Albany-Schenectady-Troy, NY 0.32
5 Albany-Schenectady-Troy, NY 0.32
6 Albany-Schenectady-Troy, NY 0.32
7                  Albany, GA 0.45
8                  Albany, GA 0.45

这是基本上解决了我的问题的SO帖子：How to do vlookup and fill down (like in Excel) in R?

【讨论】：

【解决方案2】：

我在 github xfactor 上发布了一个函数，它使用正则表达式匹配来更改因子级别，并且可以完成上述操作。使用devtools::install_github("jwilliman/xfactor") 安装。 levels 参数包含所需的正则表达式（截断的 DMA），labels 表达式包含所需的输出（完整的 DMA 代码）。


library(xfactor)

dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)

DMA.df <- data.frame(dma, cost)


 within(DMA.df, {
   dma = xfactor::xfactor(
     dma, 
     levels = c("Abilene", "Albany, GA", "Albany-Schenec"),
     labels = c("Abilene-Sweetwater, TX", "Albany, GA", "Albany-Schenectady-Troy, NY")
   )
 })
#>                           dma cost
#> 1      Abilene-Sweetwater, TX 0.46
#> 2      Abilene-Sweetwater, TX 0.46
#> 3                  Albany, GA 0.45
#> 4                  Albany, GA 0.45
#> 5 Albany-Schenectady-Troy, NY 0.32
#> 6 Albany-Schenectady-Troy, NY 0.32
#> 7 Albany-Schenectady-Troy, NY 0.32
#> 8 Albany-Schenectady-Troy, NY 0.32

^{由reprex package (v0.3.0) 于 2020-04-18 创建}

【讨论】：