【发布时间】:2021-01-16 11:36:37
【问题描述】:
我有两个示例数据框,df1 和 df2,如下所示。
df1 拥有选定的网球比赛装置列表,其中包含球员姓名(player1_name,player_name2)和比赛日期。此处使用玩家全名。
df2 拥有每个日期的所有网球比赛结果列表(winner、loser)。在这里,使用名字的第一个字母和完整的姓氏。
固定装置和结果的球员姓名是从不同的网站上抓取的。因此,在某些情况下,姓氏可能不完全匹配。
考虑到这一点,我想在df1 中添加一个新列,说明 player1 或 player2 是否赢了。基本上,我想通过给定相同日期的某些部分匹配方式,将df1 中的player1_name 和player2_name 映射到df2 的赢家和输家。
dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere",
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen",
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson",
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas",
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")
dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"),
winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin",
"A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis",
"M Giron", "A Rublev", "N Djokovic", "R Carballes Baena",
"A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov",
"G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin",
"V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot",
"G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey",
"M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili",
"J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
)), row.names = c(NA, -20L), class = "data.frame")
我创建了一个自定义函数,它可以使用 RecordLinkage 包将字符串与字符串向量中最接近的匹配项进行匹配。我可以使用这个函数编写一个效率极低的代码,但在去那里之前,我想看看我是否能以更有效的方式做到这一点。
ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
df<- character()
for (i in 1:length(string)) {
distance <- levenshteinSim(string[i], stringVector)
if (max(distance)>=max_threshold) {
df[i]<- stringVector[which.max(distance)]
}
else {
df[i]= NA
}
}
return(df)
}
【问题讨论】:
-
查看
?adist
标签: r tidyverse fuzzy-comparison data-wrangling fuzzyjoin