通过R中的部分字符串匹配合并答案

【问题标题】：Merge by partial string match in R通过R中的部分字符串匹配合并
【发布时间】：2021-11-19 18:09:54
【问题描述】：

我有一个 df 如下

+-------+---------+-------+
| Brand |  WORD   | Count |
+-------+---------+-------+
| ABC   | cell    |     1 |
| DEF   | dock    |     2 |
| XYZ   | surface |     3 |
| LMN   | pro     |     4 |
| ABC   | mobile  |     5 |
| DEF   | game    |     6 |
| XYZ   | mouse   |     7 |
+-------+---------+-------+

还有一个：

+-------+-----------------+--------+
| Brand |      Name       | profit |
+-------+-----------------+--------+
| ABC   | cell game       |     10 |
| ABC   | cellular mobile |     20 |
| DEF   | docking station |     30 |
| XYZ   | surface mouse   |     40 |
| XYZ   | mouse device    |     50 |
| LMN   | pro device      |     60 |
+-------+-----------------+--------+

我想通过部分字符串匹配（逐字逐句，意味着单元格仅与单元格而不是蜂窝单元匹配）单词和名称并按品牌分组来合并它们，因此结果表如下：

+-------+---------------+-----------------+-------+--------+
| Brand |     WORD      |      Name       | Count | profit |
+-------+---------------+-----------------+-------+--------+
| ABC   | cell          | cell game       |     1 |     10 |
| ABC   | mobile        | cellular mobile |     5 |     20 |
| XYZ   | surface mouse | surface mouse   |     3 |     40 |
| XYZ   | mouse         | mouse device    |     7 |     50 |
| XYZ   | mouse         | mouse device    |     7 |     50 |
| LMN   | pro           | pro device      |     4 |     60 |
+-------+---------------+-----------------+-------+--------+

我尝试使用这里的解决方案 R partial string matching and return value (in R)

但它甚至匹配字符串的一部分，例如单元格将与蜂窝匹配想知道是否有办法让字符串完全匹配并以所需的形式获得结果

【问题讨论】：

这会很棘手。您将不得不定义一大堆新元素。例如，为什么表面鼠标不能与鼠标设备连接？两者都包含单词mouse。我的意思是对人类大脑来说，我们知道为什么您希望 Surface 鼠标与 Surface 鼠标连接，但我不明白您为什么不希望它与鼠标设备连接
在类似的情况下，我当时的解决方案是首先“清理”名称列以删除可能发生的脱靶实例。在您给出的脱靶示例中，可能会执行df2$Name = gsub("cellular mobile", "mobile", df2$Name) 之类的操作。不完美，但如果你没有很多脱靶的部分匹配，那么只需进行一点数据检查，这对你来说就可以了。

标签： r string dplyr fuzzywuzzy

【解决方案1】：

我们在这里假设您要将Brand 列和WORD 列与Name 列匹配，并且输出将按profit 排序。问题中显示的输出有一个重复的行，我们认为这是一个错误。输入 d1 和 d2 在末尾的注释中重复显示。

我们在WORD 和Name 的两边添加一个空格，以确保只使用单词匹配。 like 模式中使用的 % 是一个通配符，可以匹配任何 0 个或多个字符的字符串。

library(sqldf)

sqldf("select d1.Brand, d2.Name, d1.WORD, d1.Count, d2.profit
  from d1
  join d2 on d1.Brand = d2.Brand and 
             ' ' || d2.Name || ' ' like '% ' || d1.WORD || ' %'
  order by d2.profit")

给予：

  Brand            Name    WORD Count profit
1   ABC       cell game    cell     1     10
2   ABC cellular mobile  mobile     5     20
3   XYZ   surface mouse surface     3     40
4   XYZ   surface mouse   mouse     7     40
5   XYZ    mouse device   mouse     7     50
6   LMN      pro device     pro     4     60

注意

可重现形式的输入。

d1 <-
structure(list(Brand = c("ABC", "DEF", "XYZ", "LMN", "ABC", "DEF", 
"XYZ"), WORD = c("cell", "dock", "surface", "pro", "mobile", 
"game", "mouse"), Count = c(1, 2, 3, 4, 5, 6, 7)), class = "data.frame", row.names = c(NA, 
-7L))

d2 <-
structure(list(Brand = c("ABC", "ABC", "DEF", "XYZ", "XYZ", "LMN"
), Name = c("cell game", "cellular mobile", "docking station", 
"surface mouse", "mouse device", "pro device"), profit = c(10, 
20, 30, 40, 50, 60)), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：

嗨，在您的分析中，d2 实际上是最终结果，它不是我要匹配的第二个数据帧
好的。已经修好了。请注意，输入应该在问题中显示为dput(X) 的输出，其中X 是输入。查看r标签页的顶部我已经为你做了这一次。