【问题标题】:Text String Isolation and Transformation in RR中的文本字符串隔离和转换
【发布时间】:2016-07-12 20:01:22
【问题描述】:

这个df1 数据框看起来与我在现实生活中使用的东西非常相似(两列):

df1 <- data.frame(provider = c("LeBron James, MD",
                          "Peyton Manning, DDS",
                          "Mike Trout, DO"),
             cpt_codes = c("This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group",
                           "Overutilization of visits per patient for E0781-RR-59 and J1100!",
                           "High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%"))

print(df1)
#             provider                                                                                       cpt_codes
#1    LeBron James, MD    This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group
#2 Peyton Manning, DDS                                Overutilization of visits per patient for E0781-RR-59 and J1100!
#3      Mike Trout, DO High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%

我需要从 cpt_codes 字段中提取长度为 5 个(字母数字)字符并以数字 (0:9) 结尾的所有字符块。然后我需要将它们与provider 字段匹配,其中包含每个提供程序/cpt_code 组合的唯一行。最终结果如下所示:

#             provider cpt_codes
#1    LeBron James, MD     99284
#2    LeBron James, MD     99282
#3    LeBron James, MD     99285
#4 Peyton Manning, DDS     E0781
#5 Peyton Manning, DDS     J1100
#6      Mike Trout, DO     29581
#7      Mike Trout, DO     93990

通过研究,我发现了一些关于 R 中文本字符串的非常好的 stackoverflow 问题和答案,这些问题和答案让我能够在下面拼凑出我的解决方案。这个解决方案让我得到了我想要的,但它似乎过于复杂。我期待着看看其他人能否以更简洁的方式提出“最终”输出。

library(stringr)
#replace all punctuation with spaces in the text strings
df1$cpt_codes <- str_replace_all(df1$cpt_codes, "[[:punct:]]", " ")

#identifies all 5 character blocks in the text strings
t <- str_extract_all(df1$cpt_codes, "\\b[a-zA-Z0-9]{5,5}\\b")

#makes a new data frame that keeps only the 5 character blocks ending in a numeric char
fn <- c(0:9)
cpts <- function(x) {
  t1 <- subset(t[[x]], grepl(paste(fn, collapse = "|"), substr(t[[x]], 5, 5)) == TRUE)
  data.frame(id = rep(x, length(t1)), cpt_codes = t1)
}
t2 <- do.call("rbind", (lapply(c(1:length(t)), function(x) cpts(x))))

#creates an "id" field on the df1
df1$id <- c(1:nrow(df1))
df3 <- df1[, -2]

final <- merge(df3, t2, by = "id")
final[, -1]

print(final)
#            provider cpt_codes
#1    LeBron James, MD     99284
#2    LeBron James, MD     99282
#3    LeBron James, MD     99285
#4 Peyton Manning, DDS     E0781
#5 Peyton Manning, DDS     J1100
#6      Mike Trout, DO     29581
#7      Mike Trout, DO     93990

【问题讨论】:

    标签: regex r text


    【解决方案1】:

    你可以试试这个正则表达式\\b\\w{4}\\d\\b,另外我认为[[:punct:]]也是一种单词边界,所以你不必用空格替换它们。

    library(dplyr); library(tidyr); library(stringr)
    df1 %>% mutate(cpt_codes = str_extract_all(cpt_codes, "\\b\\w{4}\\d\\b")) %>% unnest()
    
    #              provider cpt_codes
    # 1    LeBron James, MD     99284
    # 2    LeBron James, MD     99282
    # 3    LeBron James, MD     99285
    # 4 Peyton Manning, DDS     E0781
    # 5 Peyton Manning, DDS     J1100
    # 6      Mike Trout, DO     29581
    # 7      Mike Trout, DO     93990
    

    【讨论】:

    • 元字符\\w 将匹配下划线,[a-zA-Z0-9] 可能是最安全的。
    • @Psidom。这是尽可能简洁的。我正在努力寻找有关“\\w”的文档(尽管使用您的函数可以很直观地了解正在发生的事情)。
    • @PierreLafortune 谢谢。
    【解决方案2】:

    这可以在基础 R 中使用 gregexpr()regmatches() 完成,如下所示:

    cn <- 'cpt_codes';
    m <- regmatches(df1[[cn]],gregexpr('[a-zA-Z0-9]{4}[0-9]',as.character(df1[[cn]])));
    res <- df1[rep(seq_along(m),lengths(m)),setdiff(names(df1),cn),drop=F];
    res[[cn]] <- unlist(m);
    res;
    ##                provider cpt_codes
    ## 1      LeBron James, MD     99284
    ## 1.1    LeBron James, MD     99282
    ## 1.2    LeBron James, MD     99285
    ## 2   Peyton Manning, DDS     E0781
    ## 2.1 Peyton Manning, DDS     J1100
    ## 3        Mike Trout, DO     29581
    ## 3.1      Mike Trout, DO     93990
    

    【讨论】:

      【解决方案3】:

      一个data.table解决方案

      df1 <- data.frame(provider = c("LeBron James, MD",
                                     "Peyton Manning, DDS",
                                     "Mike Trout, DO"),
                        cpt_codes = c("This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group",
                                      "Overutilization of visits per patient for E0781-RR-59 and J1100!",
                                      "High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%"))
      
      
      
          require(data.table)
           ddt <- as.data.table(df1)
          > library(stringr)
          > ddt[,str_extract_all(cpt_codes, "\\b\\w{4}\\d\\b"),provider]
                        provider    V1
          1:    LeBron James, MD 99284
          2:    LeBron James, MD 99282
          3:    LeBron James, MD 99285
          4: Peyton Manning, DDS E0781
          5: Peyton Manning, DDS J1100
          6:      Mike Trout, DO 29581
          7:      Mike Trout, DO 93990
      

      【讨论】:

        猜你喜欢
        • 2011-08-10
        • 1970-01-01
        • 2023-03-07
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-12
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多