如何从用户自定义/定义函数构建距离矩阵答案

【问题标题】：How to buid a distance matrix from user custom/defined function如何从用户自定义/定义函数构建距离矩阵
【发布时间】：2021-03-27 15:46:14
【问题描述】：

下午好！

在 R 下，我开发了一个计算混合向量之间距离的自定义函数。

使用的数据是：

data=structure(list(X126 = c("X266", "B7", "T133", "J34", "T218", 
"X249"), TVGUIDE = c("TVGUIDE", "MODMAT", "MASSEY", "KMART", 
"MASSEY", "ROSES"), YES = c("YES", "YES", "YES", "NO", "YES", 
"NO"), KEY = c("KEY", "KEY", "KEY", "KEY", "KEY", "KEY"), YES.1 = c("YES", 
"YES", "YES", "YES", "YES", "YES"), BENTON = c("BENTON", "BENTON", 
"BENTON", "BENTON", "BENTON", "BENTON"), GALLATIN = c("GALLATIN", 
"GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN"), 
    UNCOATED = c("UNCOATED", "UNCOATED", "UNCOATED", "UNCOATED", 
    "UNCOATED", "COATED"), UNCOATED.1 = c("UNCOATED", "COATED", 
    "UNCOATED", "COATED", "UNCOATED", "COATED"), NO = c("NO", 
    "NO", "NO", "NO", "NO", "NO"), LINE = c("LINE", "LINE", "LINE", 
    "LINE", "LINE", "LINE"), YES.2 = c("YES", "YES", "YES", "YES", 
    "YES", "YES"), Motter94 = c("Motter94", "WoodHoe70", "WoodHoe70", 
    "WoodHoe70", "WoodHoe70", "Motter94"), TABLOID = c("TABLOID", 
    "CATALOG", "CATALOG", "TABLOID", "CATALOG", "TABLOID"), NorthUS = c("NorthUS", 
    "NorthUS", "NorthUS", NA, "NorthUS", "CANADIAN"), band = c("noband", 
    "noband", "noband", "noband", "noband", "noband"), X25503 = c(25503L, 
    47201L, 39039L, 37351L, 38039L, 35751L), X821 = c(821L, 815L, 
    816L, 816L, 816L, 827L), X2 = c(2L, 9L, 9L, 2L, 2L, 2L), 
    X1911 = c(NA, NA, 1910L, 1910L, 1910L, 1911L), X46 = c(46L, 
    40L, 40L, 46L, 40L, 46L), X78 = c(80L, 80L, 75L, 80L, 76L, 
    75L), X20 = c(20L, 30L, 30L, 30L, 28L, 30L), X1700 = c(1900L, 
    1850L, 1467L, 2100L, 1467L, 2600L), X40 = c(40L, 40L, 40L, 
    40L, 40L, 40L), X100 = c(100L, 100L, 100L, 100L, 100L, 100L
    ), X55 = c(55, 62, 52, 50, 50, 50), X0.2 = c(0.3, 0.433, 
    0.3, 0.3, 0.267, 0.3), X17 = c(15, 16, 16, 17, 16.8, 16.5
    ), X0.75 = c(0.75, NA, 0.3125, 0.75, 0.4375, 0.75), X13.1 = c(6.6, 
    6.5, 5.6, 0, 8.6, 0), X50.5 = c(54.9, 53.8, 55.6, 57.5, 53.8, 
    62.5), X36.4 = c(38.5, 39.8, 38.8, 42.5, 37.6, 37.5), X0 = c(0, 
    0, 0, 5, 5, 6), X0.1 = c(0, 0, 0, 0, 0, 0), X2.5 = c(2.5, 
    2.8, 2.5, 2.3, 2.5, 2.5), X1 = c(0.7, 0.9, 1.3, 0.6, 0.8, 
    0.6), X34 = c(34, 40, 40, 35, 40, 30), X105 = c(105, 103.87, 
    108.06, 106.67, 103.87, 106.67)), row.names = c(NA, 6L), class = "data.frame")
    
data

定义的函数是（x 和 y 是行的索引）：

mixed_similarity_distance<-function(data=data,x,y){

length_charachter_part=length(which(sapply(data,class)=="character"))

comparison<-c(data[x,1:length_charachter_part]==data[y,1:length_charachter_part])


char_distance=length_charachter_part-table(comparison)["TRUE"]


numerical_distance=dist(rbind(data[x,-c(1:length_charachter_part)],data[y,-c(1:length_charachter_part)]))

total_distance=numerical_distance+char_distance 

return(total_distance)    
    
}

计算距离示例：

mixed_similarity_distance(data=data,1,1)  # output 0 

mixed_similarity_distance(data=data,2,2)  # output 0 

mixed_similarity_distance(data=data,3,1)  # distance between the first and the third rows.

使用所有可能的行对，我想计算距离矩阵。

我试过了：

distance_matrix <- Vectorize(mixed_similarity_distance, c("x", "y"))

distance_matrix(1:nrow(data), 1:nrow(data), data)

希望我的问题很清楚！

感谢您的帮助！

【问题讨论】：

标签： r

【解决方案1】：

您可以使用apply 函数和expand.grid 尝试以下操作

#Computing all distances
res <- apply(expand.grid(1:6,1:6), 1, function(x) {
  mixed_similarity_distance(data = data, x[1],x[2])})

#Convert res into a matrix
matrix(res,nrow = 6,ncol = 6,byrow = TRUE)
         [,1]      [,2]      [,3]       [,4]       [,5]      [,6]
[1,]     0.00 22712.811 13851.308 12122.0171 12829.3965 10508.754
[2,] 22712.81     0.000  8554.237 10316.7203  9599.7534 12015.549
[3,] 13851.31  8554.237     0.000  1808.8448  1001.0576  3485.796
[4,] 12122.02 10316.720  1808.845     1.0000   941.0047  1681.372
[5,] 12829.40  9599.753  1001.058   941.0047     0.0000  2561.244
[6,] 10508.75 12015.549  3485.796  1681.3721  2561.2440     0.000

【讨论】：

使用的数据更大。我正在使用嵌套应用搜索解决方案
expand.grid(1:n,1:n) 将生成所有可能组合的矩阵
根据您的问题使用所有可能的行对...。这是 expand.grid 给你的。
我认为这样更好：sapply(1:nrow(data) ,function(y) sapply(1:nrow(data),function(x) mixed_similarity_distance(data=data,x,y)))。非常感谢您的帮助！