【问题标题】:R partial string matching and return value (in R)R部分字符串匹配和返回值(在R中)
【发布时间】:2017-01-04 19:50:24
【问题描述】:

我有多个采购数据库,我需要在这些数据库上运行我为识别某些产品而构建的“关键字”列表,如果有匹配项,我想将这些产品标记为外科手术类别。

这是一个例子。

采购数据库(其实我有超过200万行要翻):

d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)

关键字和返回值列表(实际列表要长得多):

kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")

我想找到包含我的关键字字符串kw 的产品prod_desc,如果有匹配项,我想在d 数据框中添加一列,该列将返回与关联的category kwkw 数据框中。

现在我可以使用以下代码达到预期的效果:

d$match <- ifelse(d$cat <- grepl(paste(kw$kw,collapse="|"), d$name,ignore.case = TRUE) == "TRUE","SS_Bandelette","-")

但是这段代码效率不高,因为我有大约 350 个关键字,它们映射到大约 30 个不同的类别。如果触发了我的关键字之一,我可以使用什么代码在 d 数据框中自动返回类别?

非常感谢您的帮助。

菲尔

【问题讨论】:

  • @DarshanBaral 我想这个问题是不同的。甚至我之前也是这么想的。我已经发布了答案

标签: r string match return-value product


【解决方案1】:
# made all to lowercase
d$prod_desc <- tolower(d$prod_desc)
# create a logical matrix that specifies which keywords are present on each row of 'd'
m = data.frame(sapply(kw$kw, grepl, d$prod_desc))
colnames(m) = kw$kw

# create a column in 'd' with the corresponding keyword      
d$kw <- apply(m, 1, function(x) names(x)[which(x)[1]])
# simple merge
merge(d, kw, by = "kw", all.x = T)

#           kw                          prod_desc label      category
#1  bandelette bandelette d'analyse pour glycemie     3 ss_bandelette
#2  bandelette diach. bandelette ster 19mm x 72mm     4 ss_bandelette
#3  bandelette        bandelette tvto-obtryx halo     1 ss_bandelette
#4  bandelette       bandelette mini arc precises     2 ss_bandelette
#5        <NA>                  sling male system     5          <NA>
#6        <NA>                          diachilon     6          <NA>
#7        <NA>                           aiguille     7          <NA>
#8        <NA>                               gant     8          <NA>
#9        <NA>                              label     9          <NA>
#10       <NA>                             crayon    10          <NA>

【讨论】:

  • @PhilippeLachapelle 我不确定我是否做对了。冷你用你的真实数据测试它并告诉我
  • 乔尔,非常感谢。它工作得很好。感谢您的快速答复。我现在要研究你的代码!问候。 PL
  • @PhilippeLachapelle 介意通过这个stackoverflow.com/help/someone-answers
【解决方案2】:
# Create dataframe as per original question
d<-data.frame(prod_desc=c("BANDELETTE TVTO-OBTRYX HALO", "BANDELETTE MINI ARC PRECISES", "BANDELETTE D'ANALYSE POUR GLYCEMIE", "DIACH. BANDELETTE STER 19MM X 72MM","SLING MALE SYSTEM","DIACHILON","AIGUILLE","GANT","LABEL","CRAYON"),label=1:10)
# Create keywords as per origianl question
kw<-data.frame(kw=c("bandelette","tvt","bande transvaginale","sling system","argus"),category="ss_bandelette")
# Assume you want match/tag string on word boundaries? If not; "BANDELETTE TVTO-OBTRYX HALO" would match to "tvt" for instance.
kw$kw <- paste0("\\b",kw$kw,"\\b")

x <- sapply(kw$kw, function(x) grepl(tolower(x), tolower(d$prod_desc)))
d$Match <- apply(x, 1, function(i) paste0(names(i)[i]))
d$Match <- kw$category[match(d$Match,kw$kw)]
d
#                             prod_desc label         Match
# 1         BANDELETTE TVTO-OBTRYX HALO     1 ss_bandelette
# 2        BANDELETTE MINI ARC PRECISES     2 ss_bandelette
# 3  BANDELETTE D'ANALYSE POUR GLYCEMIE     3 ss_bandelette
# 4  DIACH. BANDELETTE STER 19MM X 72MM     4 ss_bandelette
# 5                   SLING MALE SYSTEM     5          <NA>
# 6                           DIACHILON     6          <NA>
# 7                            AIGUILLE     7          <NA>
# 8                                GANT     8          <NA>
# 9                               LABEL     9          <NA>
# 10                             CRAYON    10          <NA>

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-05-15
    • 1970-01-01
    • 2016-09-14
    • 1970-01-01
    • 1970-01-01
    • 2014-07-20
    • 2016-10-18
    • 1970-01-01
    相关资源
    最近更新 更多