【发布时间】:2021-03-27 15:46:14
【问题描述】:
下午好!
在 R 下,我开发了一个计算混合向量之间距离的自定义函数。
使用的数据是:
data=structure(list(X126 = c("X266", "B7", "T133", "J34", "T218",
"X249"), TVGUIDE = c("TVGUIDE", "MODMAT", "MASSEY", "KMART",
"MASSEY", "ROSES"), YES = c("YES", "YES", "YES", "NO", "YES",
"NO"), KEY = c("KEY", "KEY", "KEY", "KEY", "KEY", "KEY"), YES.1 = c("YES",
"YES", "YES", "YES", "YES", "YES"), BENTON = c("BENTON", "BENTON",
"BENTON", "BENTON", "BENTON", "BENTON"), GALLATIN = c("GALLATIN",
"GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN"),
UNCOATED = c("UNCOATED", "UNCOATED", "UNCOATED", "UNCOATED",
"UNCOATED", "COATED"), UNCOATED.1 = c("UNCOATED", "COATED",
"UNCOATED", "COATED", "UNCOATED", "COATED"), NO = c("NO",
"NO", "NO", "NO", "NO", "NO"), LINE = c("LINE", "LINE", "LINE",
"LINE", "LINE", "LINE"), YES.2 = c("YES", "YES", "YES", "YES",
"YES", "YES"), Motter94 = c("Motter94", "WoodHoe70", "WoodHoe70",
"WoodHoe70", "WoodHoe70", "Motter94"), TABLOID = c("TABLOID",
"CATALOG", "CATALOG", "TABLOID", "CATALOG", "TABLOID"), NorthUS = c("NorthUS",
"NorthUS", "NorthUS", NA, "NorthUS", "CANADIAN"), band = c("noband",
"noband", "noband", "noband", "noband", "noband"), X25503 = c(25503L,
47201L, 39039L, 37351L, 38039L, 35751L), X821 = c(821L, 815L,
816L, 816L, 816L, 827L), X2 = c(2L, 9L, 9L, 2L, 2L, 2L),
X1911 = c(NA, NA, 1910L, 1910L, 1910L, 1911L), X46 = c(46L,
40L, 40L, 46L, 40L, 46L), X78 = c(80L, 80L, 75L, 80L, 76L,
75L), X20 = c(20L, 30L, 30L, 30L, 28L, 30L), X1700 = c(1900L,
1850L, 1467L, 2100L, 1467L, 2600L), X40 = c(40L, 40L, 40L,
40L, 40L, 40L), X100 = c(100L, 100L, 100L, 100L, 100L, 100L
), X55 = c(55, 62, 52, 50, 50, 50), X0.2 = c(0.3, 0.433,
0.3, 0.3, 0.267, 0.3), X17 = c(15, 16, 16, 17, 16.8, 16.5
), X0.75 = c(0.75, NA, 0.3125, 0.75, 0.4375, 0.75), X13.1 = c(6.6,
6.5, 5.6, 0, 8.6, 0), X50.5 = c(54.9, 53.8, 55.6, 57.5, 53.8,
62.5), X36.4 = c(38.5, 39.8, 38.8, 42.5, 37.6, 37.5), X0 = c(0,
0, 0, 5, 5, 6), X0.1 = c(0, 0, 0, 0, 0, 0), X2.5 = c(2.5,
2.8, 2.5, 2.3, 2.5, 2.5), X1 = c(0.7, 0.9, 1.3, 0.6, 0.8,
0.6), X34 = c(34, 40, 40, 35, 40, 30), X105 = c(105, 103.87,
108.06, 106.67, 103.87, 106.67)), row.names = c(NA, 6L), class = "data.frame")
data
定义的函数是(x 和 y 是行的索引):
mixed_similarity_distance<-function(data=data,x,y){
length_charachter_part=length(which(sapply(data,class)=="character"))
comparison<-c(data[x,1:length_charachter_part]==data[y,1:length_charachter_part])
char_distance=length_charachter_part-table(comparison)["TRUE"]
numerical_distance=dist(rbind(data[x,-c(1:length_charachter_part)],data[y,-c(1:length_charachter_part)]))
total_distance=numerical_distance+char_distance
return(total_distance)
}
计算距离示例:
mixed_similarity_distance(data=data,1,1) # output 0
mixed_similarity_distance(data=data,2,2) # output 0
mixed_similarity_distance(data=data,3,1) # distance between the first and the third rows.
使用所有可能的行对,我想计算距离矩阵。
我试过了:
distance_matrix <- Vectorize(mixed_similarity_distance, c("x", "y"))
distance_matrix(1:nrow(data), 1:nrow(data), data)
希望我的问题很清楚!
感谢您的帮助!
【问题讨论】:
标签: r