将句子与 R 中的句子匹配？答案

【问题标题】：match a sentence with a sentence in R?将句子与 R 中的句子匹配？
【发布时间】：2017-04-07 19:04:55
【问题描述】：

我有两个数据框占用和数据。我想将数据中的每个职业与职业相匹配，并通过在职业数据框中添加一列来分配对应的类。

occupation <- c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank", "Love my profession of Professor", "NA")

occupation <- data.frame(occupation)

data <- data.frame(class = c("Engineers","Designer","Artist","Designer","Poetry""Banker and Prof"), Occupation = c("Civil Engineer", "Graphic Designer", "Painter","Poetry","Architect(prof)", "Sales Manager Bank"))

我想要这样的输出

 occupation                             class
    I am Civil Engineer human being        Engineers
    Painter  Architect Poetry              Artists
    Graphic Designer too late              Designers
    Architect by Painter profession        Architect
    Sales Manager Bank                     Banker and Prof
    Love my profession of Professor        NA
      NA                                   NA

我试过了，但它响应什么

occupation$value <- sapply(data$occupation, grepl, x = occupation)

【问题讨论】：

尝试搜索“r 模糊匹配”，直到找到你喜欢的东西。

标签： r dataframe sapply grepl

【解决方案1】：

我不知道您的数据有多复杂，但这对于低复杂度的字符串很有用。使用agrep函数可以让你设置一个容差参数，这样你就可以匹配不相等的字符串：

occupation <- data.frame(occupation = c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank"), 
                         stringsAsFactors = FALSE)
data <- data.frame(class = c("Engineers","Designer","Architect","Banker and Prof"), 
                   occupation = c("Civil Engineer", "Graphic Designer", "Architect(prof)", "Sales Manager Bank"),
                   stringsAsFactors = FALSE)

occupation$value <- sapply(occupation$occupation, function(x) {
    match.class <- sapply(data$class, function(y) agrep(y, x, max.distance = 0.2))
    data$class[which(match.class == 1)]
  }
)

如果您升起max.distance，您可以检测到最后一个文本，但previos 字符串也会这样做。

                       occupation            value
1 I am Civil Engineer human being   Civil Engineer
2       Graphic Designer too late Graphic Designer
3         Architect by profession  Architect(prof)
4              Sales Manager Bank

第二个选项匹配每个单词，但对于“我是土木工程师”的情况，单词“I”和“am”匹配所有单词。

occupation$value <- sapply(occupation$occupation, function(x) {
    match.class <- sapply(data$class, function(y) {
      any(sapply(strsplit(x, ' ')[[1]], function(z)
        any(agrep(z, y, max.distance = 0.2))))
    })
    data$class[which(match.class)]
  }
)

结果就是这样……

                       occupation                                                                 value
1 I am Civil Engineer human being Civil Engineer, Graphic Designer, Architect(prof), Sales Manager Bank
2       Graphic Designer too late                                                      Graphic Designer
3         Architect by profession                                                       Architect(prof)
4              Sales Manager Bank                                                    Sales Manager Bank

Here thelink when you can download the code

【讨论】：

我不是在问你回答的那个。我觉得有些误会。我想从职业数据框中找出职业而不是职业。
是的，对不起。我在第一个代码的第 10 行和第二个代码的第 6 行的代码中将 data$occupation[which(match.class)] 更改为 data$class[which(match.class)]
谢谢。如果我在上面编辑的职业领域有多个职业，并且我只想要第一个职业的类别。那将是什么代码？？/
只需在上面提到的同一行中将data$class[which(match.class)] 替换为data$class[which(match.class)[1]]
上面的代码不起作用我的意思是选项1给出输出整数（0），第二个选项给出错误strsplit（x，“”）：非字符参数。

【解决方案2】：

agrep 非常接近。我无法让它为 Architect(prof) 工作，但如果你删除括号，它就可以工作：

data$Occupation <- sub("\\(.*", "", data$Occupation)
data
            class         Occupation
1       Engineers     Civil Engineer
2        Designer   Graphic Designer
3        Designer          Architect
4 Banker and Prof Sales Manager Bank

occ.class <- data$class[unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))]
occ.class
[1] Engineers       Designer        Designer        Banker and Prof
Levels: Banker and Prof Designer Engineers

如果您希望第三个显示Architect，您应该在data data.frame 中相应地更改它。

至于编辑：

occ.class <- unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))
ifelse(length(occ.class), data$class[occ.class], NA)

【讨论】：

我得到这样的输出。 > occ.class [1] “工程师” “工程师” “工程师” “工程师”
在大小写不匹配的情况下。例如-我编辑了我的问题。请通过它。
> occ.class 输出：土木工程师平面设计师建筑师销售经理银行 1 2 3 4 >ifelse(length(occ.class), data$class[occ.class], NA) [1] 3 我得到这样的输出。有什么问题吗？
如果一个句子中有两个或多个职业，则只分配第一个职业。可能会为此编码什么？为此，请通过已编辑的问题。感谢您的帮助。