R中的多个字符串匹配答案

【问题标题】：Multiple String Matching in RR中的多个字符串匹配
【发布时间】：2016-04-14 09:58:58
【问题描述】：

将 A、B、C、D .... 视为单词。我有两个 DF。

df1:

ColA
A B
B C
C D
E F
G H
A M
M

df2:

ColB
A B C D X Y Z
C D M N F K L
S H A F R M T U

操作：我想在 df2 中搜索 df1 的所有元素，然后将所有匹配值追加到新列中，或者可能创建多行。

输出 1：

ColB                    COlB
A B C D X Y Z           A,A B,B C,C D
C D M N F K L           C D,M
S H A F R M T U         A,A M

输出2：

ColB                   Output
A B C D X Y Z           A
A B C D X Y Z           A B
A B C D X Y Z           B C
A B C D X Y Z           C D
C D M N F K L           C D
C D M N F K L           M
S H A F R M T U         A
S H A F R M T U         A M

【问题讨论】：

单曲“A”从何而来？所有其他都是df1$ColA 的元素。你确定这不是一个错误？
非常类似于stackoverflow.com/questions/36694470/…。

标签： r string text text-analysis

【解决方案1】：

我认为这样做可以，尽管它与您预期的答案有些不同，我认为这是错误的。

首先设置输入数据框：

# set up the data
df1 <- data.frame(ColA = c("A B", 
                           "B C", 
                           "C D",
                           "E F",
                           "G H",
                           "A M",
                           "M"),
                  stringsAsFactors = FALSE)
df2 <- data.frame(ColB = c("A B C D X Y Z",
                           "C D M N F K L",
                           "S H A F R M T"),
                  stringsAsFactors = FALSE)

接下来，我们将形成要搜索的事物与要搜索的事物的所有成对组合：

# create a vector of patterns and items to search
intermediate <- as.vector(outer(df2$ColB, df1$ColA, paste, sep = "|"))
# split it into a list
intermediate <- strsplit(intermediate, "|", fixed = TRUE)

然后我们可以创建一个函数来匹配这个完整组合数据集的每一行的元素。核心是foundMatch，它返回一个逻辑，指示ColA中的所有元素是否存在于ColB中。在您的示例中，顺序无关紧要，因此我们在这里拆分元素并查找所有第一个元素都在第二个元素中。

# set up the output data.frame
Output2 <- data.frame(do.call(rbind, intermediate))
names(Output2) <- c("ColB", "Output")

# here is the core, which does the element matching
foundMatch <- apply(Output2, 1, function(x) {
    tokens <- strsplit(x, " ", fixed = TRUE)
    all(tokens[[2]] %in% tokens[[1]])
})
# filter out the ones with the match
Output2 <- Output2[foundMatch, ]

Output2
##             ColB Output
## 1  A B C D X Y Z    A B
## 2  C D M N F K L    A B
## 3  S H A F R M T    A B
## 10 A B C D X Y Z    E F
## 14 C D M N F K L    G H
## 20 C D M N F K L      M
## 21 S H A F R M T      M

不完全是你上面所说的，但我认为它是正确的。

【讨论】：

【解决方案2】：

对我来说，你的 data.frames df1 和 df2 是如何构建的并不明显。但是您可以尝试对数据进行矢量化并匹配两组。

d1 <- sort(as.character(unlist(df1)))
d2 <- sort(as.character(unlist(df2)))
# get the intersection/difference without duplicates
intersect(d1,d2)
setdiff(d1,d2)
# get all values matching with the first or with the second dataset, respectively 
d1[ d1 %in% d2 ]
d2[ d2 %in% d1 ]

【讨论】：

这不会提供所需的输出