【问题标题】:How to perform approximate (fuzzy) name matching in R如何在 R 中执行近似(模糊)名称匹配
【发布时间】:2014-05-18 14:59:19
【问题描述】:

我有一个大型数据集,专门用于生物期刊,由不同的人编写了很长时间。因此,数据不是单一格式。例如,在“作者”列中,我可以找到 John Smith、Smith John、Smith J 等,但它是同一个人。我什至无法执行最简单的操作。例如,我不知道哪些作者写的文章最多。

R中有什么方法可以判断不同名称中的大多数符号是否相同,将它们视为相同的元素?

【问题讨论】:

  • 看看OpenRefine(以前的Google Refine)。这听起来像是比 R 更适合它的数据准备工作。它安装起来非常简单,并且拥有强大的功能,此外还有 scads 的教程和示例,其中一些涉及您的 names i> 问题。
  • 你可能想试试agrep(近似字符串匹配)stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html
  • 我会搜索“记录链接”。我有一段时间没有这样做了,但是有一个 RecordLinkage 包可能会有所帮助。另外,我记得在这个previous question 上有一些建议/链接。
  • 关于 R 包 RecordLinkage:包‘RecordLinkage’已从 CRAN 存储库中删除。以前可用的版本可以从存档中获得。由于内存访问错误未得到纠正,于 2014-05-31 存档。

标签: r analytics openrefine


【解决方案1】:

有一些软件包可以帮助您解决这个问题,其中一些在 cmets 中列出。但是,如果您不想使用这些,我会尝试在 R 中编写一些可能对您有所帮助的东西。代码会将“John Smith”与“J Smith”、“John Smith”、“Smith John”、“John S”匹配。同时,它不会匹配“John Sally”之类的内容。

# generate some random names
names = c(
  "John Smith", 
  "Wigberht Ernust",
  "Samir Henning",
  "Everette Arron",
  "Erik Conor",
  "Smith J",
  "Smith John",
  "John S",
  "John Sally"
);

# split those names and get all ways to write that name
split_names = lapply(
  X = names,
  FUN = function(x){
    print(x);
    # split by a space
    c_split = unlist(x = strsplit(x = x, split = " "));
    # get both combinations of c_split to compensate for order
    c_splits = list(c_split, rev(x = c_split));
    # return c_splits
    c_splits;
  }
)

# suppose we're looking for John Smith
search_for = "John Smith";

# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));

# initialise a vector containing if search_for was matched in names
match_statuses = c();

# for each name that's been split
for(i in 1:length(x = names)){

  # the match status for the current name
  match_status = FALSE;

  # the current split name
  c_split_name = split_names[[i]];

  # for each element in search_for_split
  for(j in 1:length(x = search_for_split)){

    # the current combination of name
    c_search_for_split_names = search_for_split[[j]];

    # for each element in c_split_name
    for(k in 1:length(x = c_split_name)){

      # the current combination of current split name
      c_c_split_name = c_split_name[[k]];

      # if there's a match, or the length of grep (a pattern finding function is
      # greater than zero)
      if(
        # is c_search_for_split_names first element in c_c_split_name first
        # element
        length(
          x = grep(
            pattern = c_search_for_split_names[1],
            x = c_c_split_name[1]
          )
        ) > 0 &&
        # is c_search_for_split_names second element in c_c_split_name second 
        # element
        length(
          x = grep(
            pattern = c_search_for_split_names[2],
            x = c_c_split_name[2]
          )
        ) > 0 ||
        # or, is c_c_split_name first element in c_search_for_split_names first 
        # element
        length(
          x = grep(
            pattern = c_c_split_name[1],
            x = c_search_for_split_names[1]
          )
        ) > 0 &&
        # is c_c_split_name second element in c_search_for_split_names second 
        # element
        length(
          x = grep(
            pattern = c_c_split_name[2],
            x = c_search_for_split_names[2]
          )
        ) > 0
      ){
        # if this is the case, update match status to TRUE
        match_status = TRUE;
      } else {
        # otherwise, don't update match status
      }
    }
  }

  # append match_status to the match_statuses list
  match_statuses = c(match_statuses, match_status);
}

search_for;

[1] "John Smith"

cbind(names, match_statuses);

     names             match_statuses
[1,] "John Smith"      "TRUE"        
[2,] "Wigberht Ernust" "FALSE"       
[3,] "Samir Henning"   "FALSE"       
[4,] "Everette Arron"  "FALSE"       
[5,] "Erik Conor"      "FALSE"       
[6,] "Smith J"         "TRUE"        
[7,] "Smith John"      "TRUE"        
[8,] "John S"          "TRUE"
[9,] "John Sally"      "FALSE"   

希望此代码可以作为起点,您可能希望对其进行调整以使用任意长度的名称。

一些注意事项:

  • for R 中的循环可能很慢。如果您使用很多名称,请查看Rcpp

  • 您可能希望将其包装在一个函数中。然后,您可以通过调整search_for 将其应用于不同的名称。

  • 此示例存在时间复杂度问题,并且根据数据的大小,您可能希望/需要对其进行返工。

【讨论】:

  • 编辑:search_for_split = unlist(x = strsplit(x = search_for, split = " "));
【解决方案2】:

这扩展了@joshua-daly 的出色响应,以实现两个有用的目标。

(1) 查找具有 n>2 个单词的名称排列(例如,Robert Allen Zimmerman aka Bob Dylan)

(2) 对少于记录的所有姓名(例如 Bob Dylan)执行定义的搜索。

library(gtools)
x <- c("Yoda","speaks","thus")
permutations(n=3, r=3, v=x, repeats.allowed = FALSE) # n=num.elems r=num.times v=x

# generate some random names
names <- c(
  "John Smith", 
  "Robert Allen Zimmerman (Bob Dylan)",
  "Everette Camille Arron",
  "Valentina Riquelme Molina",
  "Smith J",
  "Smith John",
  "John S",
  "John Sally"
);

# drop parentheses, if any
names <- gsub("[(|)]", "", names)


# split those names and get all ways to write that name into a list of same length
split_names <- lapply(
  X = gsub("[(|)]", "", names),
  FUN = function(x){
    print(x);
    # split by a space
    c_split = unlist(x = strsplit(x = x, split = " "));
    # get all permutations of c_split to compensate for order
    n <- r <- length(c_split)
    c_splits <- list(permutations(n=n, r=r, v=c_split, repeats.allowed = FALSE))
    # return c_splits
    c_splits;
  }
)

split_names

# suppose we're looking for this name
search_for <- "Bob Dylan";

# split it by " " and then find all ways to write that name
search_for_split <- unlist(x = strsplit(x = search_for, split = " "));
# permutations over search_for_split seem redundant

# initialize a vector containing if search_for was matched in names
match_statuses <- c();

# for each name that's been split
for(i in 1:length(names)){

    # the match status for the current name
    match_status <- FALSE;

    # the current split name
    c_split_name <- as.data.frame(split_names[[i]]);

    # for each element in c_split_name
    for(j in 1:nrow(c_split_name)){

        # the current permutation of current split name
        c_c_split_name <- as.matrix(c_split_name[j,]);

        # will receive hits in name's words, one by one, in sequence
        hits <- rep(0, 20) # length 20 should always be above max number of words in names

        # for each element in search_for_split
        for(k in 1:length(search_for_split)){

            # the current permutation of name
            c_search_for_split <- search_for_split[[k]];

            # L first hits will receive hit counts
            L <- min(ncol(c_c_split_name), length(search_for_split));

            # will match as many words as the shortest current pair of names  
            for(l in 1:L){

                # if there's a match, the length of grep is greater than zero
                if(
                    # is c_search_for_split in c_c_split_name's lth element
                    length(
                        grep(
                            pattern = c_search_for_split,
                            x = as.character(c_c_split_name[l])
                        )
                    ) > 0 ||
                    # or, is c_c_split_name's lth element in c_search_for_split
                    length(
                        grep(
                            pattern = c_c_split_name[l],
                            x = c_search_for_split
                        )
                    ) > 0

                # if this is the case, record a hit    
                ){
                    hits[l] <- 1;
                } else {
                # otherwise, don't update hit
                }
            }
        }

        # take L first elements
        hits <- hits[1:L]

       # if hits vector has all ones for this permutation, update match status to TRUE
       if(
           sum(hits)/length(hits)==1 # <- can/should be made more flexible (agrep, or sum/length<1)
       ){
           match_status <- TRUE;
       } else {
       # otherwise, don't update match status
       }
    }

    # append match_status to the match_statuses list
    match_statuses <- c(match_statuses, match_status);
}

search_for;

cbind(names, match_statuses);

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2010-10-10
    • 2011-01-14
    • 2019-06-11
    • 1970-01-01
    相关资源
    最近更新 更多