【问题标题】:Conditional keyed join/update _and_ update a flag column for matches条件键控连接/更新_和_更新匹配的标志列
【发布时间】:2015-09-28 13:31:37
【问题描述】:

这与question@DavidArenburg 询问的条件键连接非常相似,还有一个我似乎无法弄清楚的问题。

基本上,除了条件连接之外,我还想定义一个标志,说明匹配发生在匹配过程的哪个步骤;我的问题是我只能获得为 all 值定义的标志,而不是匹配的值。

我希望这是一个最小的工作示例:

DT = data.table(
  name = c("Joe", "Joe", "Jim", "Carol", "Joe",
           "Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
  surname = c("Smith", "Smith", "Jones",
              "Clymer", "Smith", "Klein", "Cotter",
              "Cotter", "Brown", "Smith", "Smith"),
  maiden_name = c("", "", "", "", "", "Clymer",
                  "", "", "", "", ""),
  id = c(1, 1:3, rep(NA, 7)),
  year = rep(1:4, c(4, 3, 2, 2)),
  flag1 = NA, flag2 = NA, key = "year"
)

DT
#      name surname maiden_name id year flag1 flag2
#  1:   Joe   Smith              1    1 FALSE FALSE
#  2:   Joe   Smith              1    1 FALSE FALSE
#  3:   Jim   Jones              2    1 FALSE FALSE
#  4: Carol  Clymer              3    1 FALSE FALSE
#  5:   Joe   Smith             NA    2 FALSE FALSE
#  6: Carol   Klein      Clymer NA    2 FALSE FALSE
#  7:   Ann  Cotter             NA    2 FALSE FALSE
#  8:   Ann  Cotter             NA    3 FALSE FALSE
#  9:  Beth   Brown             NA    3 FALSE FALSE
# 10:   Joe   Smith             NA    4 FALSE FALSE
# 11:   Joe   Smith             NA    4 FALSE FALSE

我的方法是,对于每一年,首先尝试匹配上一年的名字/姓氏;如果失败,则尝试匹配名字/娘家姓。我想定义flag1 表示完全匹配,定义flag2 表示婚姻。

for (yr in 2:4) {

  #which ids have we hit so far?
  existing_ids = DT[.(yr), unique(id)]

  #find people in prior years appearing to
  #  correspond to those people
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
  setkey(unmatched, name, surname)

  #merge a la Arun, define flag1
  setkey(DT, name, surname)
  DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #repeat, this time keying on name/maiden_name
  existing_ids = DT[.(yr), unique(id)]
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
  setkey(unmatched, name, surname)

  #now define flag2 = TRUE
  setkey(DT, name, maiden_name)
  DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #this is messy, but I'm trying to increment id
  #  for "new" individuals
  setkey(DT, name, surname, maiden_name)
  DT[year == yr & is.na(id),
     id := unique(
       DT[year == yr & is.na(id)], 
       by = c("name", "surname", "maiden_name")
     )[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
     ]

  #re-sort by year at the end    
  setkey(DT, year)    
}

我希望通过在定义id 时将TRUE 值包含在j 参数中,只有匹配的names(例如,第一步中的Joe)才能更新其flagTRUE,但事实并非如此——它们都已更新:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2  TRUE  TRUE
#  6: Carol   Klein      Clymer  3    2  TRUE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3  TRUE  TRUE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE

有没有办法只更新匹配行的flag 值?理想输出如下:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2 FALSE FALSE
#  6: Carol   Klein      Clymer  3    2 FALSE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3 FALSE FALSE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE

【问题讨论】:

    标签: r data.table


    【解决方案1】:

    我认为这里的标志很乱;最好简单识别id的来源:

    dt[,c("flag1","flag2"):=NULL]
    
    # create name -> id table
    namemap <- unique(dt[,.(maiden_name,id,year),keyby=.(name,surname)],by=NULL)
    
    # tag original ids
    namemap[!is.na(id),src:="original"]
    
    # carried over from earlier years
    namemap[, has_oid := any(!is.na(id)), by=key(namemap)]
    namemap[(has_oid),`:=`(
      id  = id[!is.na(id)],
      src = ifelse(is.na(id), "history", src)
    ),by=.(name,surname)]
    
    # carry over for surname changes on marriage
    namemap[maiden_name!="",`:=`(
      id  = namemap[.BY]$id,
      src = "maiden" 
    ),by=.(name,maiden_name)]
    
    # create new ids where none exists
    namemap[is.na(id),`:=`(
      id  = .GRP+max(dt$id,na.rm=TRUE),
      src = "new"
    ),by=.(name,surname)]
    
    # copy back to the original table
    setkey(dt,name,surname,year)
    setkey(namemap,name,surname,year)
    dt[namemap,`:=`(
      id  = i.id,
      src = src
    )]
    

    给了

         name surname maiden_name id year      src
     1:   Ann  Cotter              4    2      new
     2:   Ann  Cotter              4    3      new
     3:  Beth   Brown              5    3      new
     4: Carol  Clymer              3    1 original
     5: Carol   Klein      Clymer  3    2   maiden
     6:   Jim   Jones              2    1 original
     7:   Joe   Smith              1    1 original
     8:   Joe   Smith              1    1 original
     9:   Joe   Smith              1    2  history
    10:   Joe   Smith              1    4  history
    11:   Joe   Smith              1    4  history
    

    数据的原始顺序已丢失,但如果您愿意,很容易恢复。

    【讨论】:

    • 所以基本上,我们把我正在做的合并结果合并到原始表中?
    • @MichaelChirico 我已经更新了我的答案。这大概就是我会做的。我认为不需要提及年份。
    • 我担心我过于简单化了我的工作示例,或者我正在做一些更准确的事情来解决我现在要做的事情
    • 更新了,更复杂了;仍然比我实际做的更简单,但我认为我现在拥有所有必要的细微差别
    • 是的,从那以后代码已经清理了很多,但你明白了它是相当拜占庭式的。字符串数据让我做噩梦...
    【解决方案2】:

    我认为关键(没有双关语)是意识到合并正在为错过的 ID 返回NA,所以我应该在每个步骤中将flag 添加到unmatched,例如,在步骤1:

    unmatched <- dt[.(1:(yr - 1L))
                    ][!id %in% existing_ids,
                      .SD[.N], by = id][ , flag1 := TRUE]
    dt[year == yr, c("id", "flag1") := 
         unmatched[.SD, .(id, flag1), on = "name,surname"]]
    

    最后,这会产生:

    > dt[ ]
         name surname maiden_name id year flag1 flag2
     1: Carol  Clymer              3    1 FALSE FALSE
     2:   Jim   Jones              2    1 FALSE FALSE
     3:   Joe   Smith              1    1 FALSE FALSE
     4:   Joe   Smith              1    1 FALSE FALSE
     5:   Ann  Cotter              4    2    NA    NA
     6: Carol   Klein      Clymer  3    2    NA  TRUE
     7:   Joe   Smith              1    2  TRUE FALSE
     8:   Ann  Cotter              4    3  TRUE FALSE
     9:  Beth   Brown              5    3    NA    NA
    10:   Joe   Smith              1    4  TRUE FALSE
    11:   Joe   Smith              1    4  TRUE FALSE
    

    剩下的一个问题是一些应该是F 的标志已经重置为NA;能够设置nomatch=F 会很好,但我不太担心这种副作用——对我来说关键是知道每个标志何时是T

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-08-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多