用数据框中的单词替换确切的字符串并匹配仅包含某个单词的字符串答案

【问题标题】：Replacing exact string with a word from a dataframe & matching strings just containing a certain word用数据框中的单词替换确切的字符串并匹配仅包含某个单词的字符串
【发布时间】：2020-07-29 12:14:08
【问题描述】：

我正在使用 R 并且我有两个数据框。一个数据框 my_data 是我的主要数据集，其中包含订单数据，另一个 word_list 包含我想与 my_data 匹配的单词列表。

这是两个数据框的可重现示例：

my_data <- data.frame(
  Order = c("1","2", "3", "4", "5", "6"),
  Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"))

word_list <- data.frame(
  Codes = c("TS678","AB123", "GF123", "CC756"),
  Product_Category = c("Apple", "Apple", "Orange", "Orange"))

我想做的是将my_data 中的 Product_ID 与word_list 中的代码相匹配，并在my_data 中添加一个新列，其中匹配来自word_list 的 Product_Category。
但是，我需要实现完全匹配并考虑代码组合（如示例数据中的“PACK”所示，它在一列中包含多个产品代码）

对于最终的数据框，我希望得到以下结果：

完全匹配 -> 添加相应的 Product_Category，例如“苹果”
匹配包含来自word_list 的代码但也包含其他代码的列。某些产品是 Packs 并且 ID 与其他 ID 混合在一起 -> 如果包含“apple”的代码以及包含其他代码，这将导致“Apple + Other”。这里的另一个问题是需要匹配的Code还带有一个计数（例如，PACK12包括1x GF123、1xML680等）
应将不包含完全匹配或混合匹配的所有列分配为“其他”

为了更好地理解，我希望得到的最终结果是一个如下所示的数据框：

 my_data_result <- data.frame(
  Order = c("1","2", "3", "4", "5", "6"),
  Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"),
  Product_Category = c("Apple", "Apple", "Orange + Other", "Apple", "Other", "Orange"))

我认为这可以使用 regex 和 gsub 来完成，但我不确定如何。

谢谢！

【问题讨论】：

就merge()...
嗨@Sotos！我不确定，但据我了解，“合并”不会考虑混合匹配，因此我有多个产品 ID 的列 - 例如上面示例中的“PACK12、1xGF123、1xML680”。跨度>
抱歉。我没听懂。
@Sotos，也许还要补充一件事：我的匹配列表中也没有所有产品代码，只有“Apple”和“Orange”的代码。所有其他产品代码都无关紧要，但我需要考虑到它们有时会与 Apple 和 Orange 在一个列中同时出现。

标签： r dataframe text gsub

【解决方案1】：

由于您的数据量很大，您可以尝试这种data.table 方法：

library(data.table)
library(splitstackshape)

#Convert to data.table
setDT(my_data)
setDT(word_list)

#Get the data in long format
df1 <- cSplit(my_data, 'Product_ID', direction = 'long')
#Remove initial characters 
df1[, Product_ID := sub('.*x', '', Product_ID)]

#Join the dataframes
df1 <- merge(df1, word_list, by.x = 'Product_ID', by.y = 'Codes', all.x = TRUE)
#Replace NA with "Other"
df1[, Product_Category := replace(Product_Category, 
                           is.na(Product_Category), 'Other')]

#Combine the values by Order
df1[, .(Product_ID = toString(Product_ID), 
       Product_Category = paste(sort(unique(Product_Category)), 
                          collapse = " + ")), Order]

#   Order            Product_ID Product_Category
#1:     5 2304TR, ML680, PACK13            Other
#2:     2                 AB123            Apple
#3:     4                 AB123            Apple
#4:     3  GF123, ML680, PACK12   Orange + Other
#5:     6                 GF123           Orange
#6:     1                 TS678            Apple

【讨论】：

嘿@Ronak！非常感谢您的回答！我已经尝试过了，代码执行速度很快。但是，我注意到它给了我不同的类别：我在最终结果中得到“Orange + Other”以及“Other + Orange”。有没有办法避免这种情况？谢谢！
@emil_rore 是的，你可以sort 它，所以顺序总是一样的。我已经更新了答案。
我还注意到它会做出“Apple + Orange + Other”、“Other + Orange + Apple”、“Orange + Apple + Other”等组合，尽管它们都是一样的类别。我已经手动映射了它们。但是，您认为有办法避免这种情况吗？再次感谢！
嗯...这很奇怪。它不应该发生，因为我们在这里使用sort。它应该始终给出正确的顺序。有什么办法重现吗？

【解决方案2】：

这是一个使用dplyr 和tidyr 的想法。我们将行拆分为长，清理代码，与word_list 匹配并按订单转换回字符串，即

library(dplyr)
library(tidyr)

my_data %>% 
 separate_rows(Product_ID, sep = ', ') %>% 
 mutate(Product_ID = sub('.*x', '', Product_ID), 
        Product_Category = as.character(word_list$Product_Category[match(Product_ID, word_list$Codes)]), 
        Product_Category = replace(Product_Category, is.na(Product_Category), 'Other')) %>%
 group_by(Order) %>% 
 summarise_all(list(~toString(unique(.))))

# A tibble: 6 x 3
#  Order Product_ID            Product_Category
#  <fct> <chr>                 <chr>           
#1 1     TS678                 Apple           
#2 2     AB123                 Apple           
#3 3     PACK12, GF123, ML680  Other, Orange   
#4 4     AB123                 Apple           
#5 5     PACK13, ML680, 2304TR Other           
#6 6     GF123                 Orange

【讨论】：

嗨@Sotos！非常感谢您的回答！我正在尝试这段代码，但是，代码已经运行了一个小时，我不确定问题是什么。我的数据集非常大（900k 观察），所以这可能是一个问题。有什么解决方案可以让我更快地处理数据吗？也可以在多个步骤中执行此操作和/或覆盖最终表格中的“Product_ID”。谢谢！
900K 不算多。这不应该是一个问题。尝试重新启动会话
嘿@Sotos，昨天运行了一个多小时后，它似乎没有合并Product_Categories。因此，我重新启动了 R 会话并导出了一个行已经分开的文件，因此不需要第一个“separate_rows”函数（以节省时间）。我不知道为什么，但由于某种原因，代码似乎仍然没有进展并且需要很长时间。