【问题标题】:what is the best way to join tables when key(s) are not normalized in R?当 R 中的键未标准化时,连接表的最佳方法是什么?
【发布时间】:2014-04-21 15:19:39
【问题描述】:

说,我有两张表,名字和年龄是这样的:

> name
    key   name
1 a,b,c   jack
2     d daniel
3     e    foo
4   f,g    bar
> age
  key age
1   b  13
2   d  21
3   e  24
4   k  34
5   f 100

我正在尝试使用两个表中都存在的键列来连接这两个表。这里的挑战是名称表中的键列未标准化。我的问题是,将上述两个表组合在一起的最佳方法是什么,以使名称表中的所有行都存在并且在连接表中保持原样(如“res”表)?

> res
    key   name age
1 a,b,c   jack  13
2     d daniel  21
3     e    foo  24
4   f,g    bar 100

这里是必要的表格信息

> dput(name)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor")), .Names = c("key", 
"name"), class = "data.frame", row.names = c(NA, -4L))

> dput(age)

structure(list(key = structure(c(1L, 2L, 3L, 5L, 4L), .Label = c("b", 
"d", "e", "f", "k"), class = "factor"), age = c(13L, 21L, 24L, 
34L, 100L)), .Names = c("key", "age"), class = "data.frame", row.names = c(NA, 
-5L))

> dput(res)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor"), 
    age = c(13L, 21L, 24L, 100L)), .Names = c("key", "name", 
"age"), class = "data.frame", row.names = c(NA, -4L))

【问题讨论】:

    标签: r join merge


    【解决方案1】:

    也许您可以将“名称”data.frame 中的“键”列强制转换为正则表达式模式并使用 sapply,如下所示:

    sapply(gsub(",", "|", name$key), function(x) grep(x, age$key))
    # a|b|c     d     e   f|g 
    #     1     2     3     5 
    

    上面基本上是从找到匹配项的“年龄”data.frame 中返回的行号,按照找到的顺序。

    然后您可以使用此信息从“年龄”data.frame 中提取“年龄”值,使用基本的[row, col] 提取如下,将结果分配给age$age

    age[sapply(gsub(",", "|", name$key), function(x) grep(x, age$key)), "age"]
    # [1]  13  21  24 100
    

    【讨论】:

      【解决方案2】:

      我不介意使用 2 个连接:

      library(plyr)
      # factors to character vectors:
      name <- as.data.frame(sapply(name, as.character), stringsAsFactors=F)
      
      # split comma-seperated ids into named list:
      (tmp <- setNames(strsplit(name$key, ","), name$name))
      # $jack
      # [1] "a" "b" "c"
      # 
      # $daniel
      # [1] "d"
      # 
      # $foo
      # [1] "e"
      # 
      # $bar
      # [1] "f" "g"
      
      # list to long 2-column data frame:
      (tmp <- setNames(ldply(tmp, matrix), c("name", "key")) )
      #     name key
      # 1   jack   a
      # 2   jack   b
      # 3   jack   c
      # 4 daniel   d
      # 5    foo   e
      # 6    bar   f
      # 7    bar   g
      
      # join data frame with age table (1st join) &
      # add original comma-seperated key column (2nd join)
      join(join(age, b, type="inner"),
           name, by="name")[-1] 
      #   age   name   key
      # 1  13   jack a,b,c
      # 2  21 daniel     d
      # 3  24    foo     e
      # 4 100    bar   f,g
      

      【讨论】:

        【解决方案3】:

        对于每一行,我将使用 stringi 包中的 stri_split_fixed 函数拆分每个复杂键,然后尝试匹配第二个数据集中的一个键。

        library(stringi)
        res <- name
        keys <- stri_split_fixed(name$key, ",") # returns a list of individual keys in each row
        res$age <- sapply(1:nrow(name), function(r) {
           keys <- keys[[r]] # get the keys in rth row
           age$age[which(age$key %in% keys)]
        })
        

        这给出了您要求的结果。

        如果键包含(或可能包含)空格,那么正则表达式搜索会更合适:

        stri_split_regex(name$key, ",\\p{Z}*")
        

        甚至是单词字符序列的提取

        stri_extract_all_regex(name$key, "\\w+")
        

        【讨论】:

          猜你喜欢
          • 2010-10-18
          • 2020-04-17
          • 1970-01-01
          • 2016-11-24
          • 2021-03-16
          • 2012-08-23
          • 2020-07-31
          • 1970-01-01
          • 2020-09-26
          相关资源
          最近更新 更多