【问题标题】:Conditionally extracting multiple substrings using grepl specifically, indexing extractions to return cumulative value专门使用 grepl 有条件地提取多个子字符串,索引提取以返回累积值
【发布时间】:2018-05-20 15:20:32
【问题描述】:

祝大家今天好,

我无法完成这项具有挑战性的任务,因此我想找到一种优雅的方法来:

  1. 我需要对“区域”中的每个行元素使用一种适应性强的方法,例如循环
  2. 从按“Zone”元素分组的“country_name”中逐行提取多个子字符串
  3. 将逐行的多个子字符串存储为索引值以用于 df2
  4. 将索引值与 df2 中的数据框匹配
  5. 计算总人口并根据 df1 对其进行变异

本质上的挑战是,该方法不应该针对数据框中的任何特定元素进行修复。

第一个数据帧:

df1 <- data.frame(zone, country_name)
zone = c("M", "N", "O")
country_name = c("The USA, Canada & Mexico are part of North America", "Canada like Australia is a Commonwealth member", "The UK is still finalizing its exit plans from the EU")

第二个数据框:

df2 <- data.frame(zonal_region, country, population)
zonal_region = c("M", "M", "M", "N", "N", "N", "O", "O", "O")
country = c("USA", "Canada", "Mexico", "Canada", "Australia", "UK", "Australia", "UK", "Canada")
population = c(323.4 , 36.29, 127.5, 36.29, 24.13, 65.64, 24.13, 65.64, 36.29)

这是我最终输出的样子:

df3 <- data.frame(zone, country_name, total_population)
zone = c("M", "N", "O")
country_name = c("The USA, Canada & Mexico are part of North America", "Canada like Australia is a Commonwealth member", "The UK is still finalizing its exit plans from the EU")
total_population = c(487.19, 60.42, 65.64)

我在提取多个子字符串并针对给定区域的 df2 索引它们的值时遇到了麻烦。

如果有人能解决这个问题,将不胜感激。

谢谢!

【问题讨论】:

    标签: r


    【解决方案1】:

    你可以试试fuzzyjoin

    library(dplyr)
    library(stringr)
    library(fuzzyjoin)
    
    df1 %>% 
      mutate_if(is.factor, as.character) %>%
      fuzzy_left_join((df2 %>% mutate_if(is.factor, as.character)),
                      by = c("zone" = "zonal_region", "country_name" = "country"), 
                      match_fun = str_detect) %>%
      group_by(zone, country_name) %>%
      summarise(total_population = sum(population)) %>%
      data.frame()
    

    输出为:

      zone                                          country_name total_population
    1    M    The USA, Canada & Mexico are part of North America           487.19
    2    N        Canada like Australia is a Commonwealth member            60.42
    3    O The UK is still finalizing its exit plans from the EU            65.64
    

    样本数据:

    df1 <- structure(list(zone = structure(1:3, .Label = c("M", "N", "O"
    ), class = "factor"), country_name = structure(c(3L, 1L, 2L), .Label = c("Canada like Australia is a Commonwealth member", 
    "The UK is still finalizing its exit plans from the EU", "The USA, Canada & Mexico are part of North America"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
    -3L))
    
    df2 <- structure(list(zonal_region = structure(c(1L, 1L, 1L, 2L, 2L, 
    2L, 3L, 3L, 3L), .Label = c("M", "N", "O"), class = "factor"), 
        country = structure(c(5L, 2L, 3L, 2L, 1L, 4L, 1L, 4L, 2L), .Label = c("Australia", 
        "Canada", "Mexico", "UK", "USA"), class = "factor"), population = c(323.4, 
        36.29, 127.5, 36.29, 24.13, 65.64, 24.13, 65.64, 36.29)), class = "data.frame", row.names = c(NA, 
    -9L))
    

    【讨论】:

    • 我使用这个解决方案中的方法来解决我当前的工作问题,并且该方法适用于模糊连接而不是使用 stringr 中的 str_extract_all;这使得该方法可以高度调整并适合我必须进行的其他更改
    【解决方案2】:

    我们可以在从 'df1' 的 'country_name' 列中提取 'country' 并执行 group_by sum 后对两个数据集执行 left/right joins 来做到这一点

    library(tidyverse)
    un1 <- unique(df2$country)
    df1 %>%
       mutate(cntry =  str_extract_all(country_name, paste(un1, collapse="|"))) %>% 
       right_join(df2, by = c('zone' = 'zonal_region')) %>% 
       group_by(zone) %>% 
       summarize(total_population= sum(population[country %in% cntry[[1]]])) %>% 
       left_join(df1) %>%
       select(zone, country_name, total_population)
    # A tibble: 3 x 3
    #  zone  country_name                                          total_population
      <fct> <fct>                                                            <dbl>
    #1 M     The USA, Canada & Mexico are part of North America               487. 
    #2 N     Canada like Australia is a Commonwealth member                    60.4
    #3 O     The UK is still finalizing its exit plans from the EU             65.6
    

    【讨论】:

    • 我使用了这种方法,它对上述问题也很有效,但由于某种原因,stringr 与模糊连接的能力相比似乎不太适应。但是,上述解决方案也是正确且有效的,但由于扩展的适应性,我更喜欢使用模糊连接的解决方案!
    猜你喜欢
    • 1970-01-01
    • 2019-02-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-05-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多