【问题标题】:k Nearest Neighbour in mysqlmysql中的k最近邻居
【发布时间】:2025-04-22 11:15:01
【问题描述】:

我在 MySQL 中有下表:

DATE   EDGE   VALUE
D      E1       X1
D      E2       Y1
D      E3       Z1


D1      E1       X2
D1      E2       Y2
D1      E3       Z2


D2      E1       X3
D2     E2       Y3
D2      E3       Z3

现在我想计算 D 到 D1 和 D 到 D2 的欧几里得距离 距离(D-D1)= Sqrt((X1-X2)^2 +(Y1-Y2)^2 +(Z1-Z2)^2 ); 距离(D-D2)= Sqrt((X1-X3)^2 +(Y1-Y3)^2 +(Z1-Z3)^2 ); .......等等..

从这个距离我想选择 D 的“k”个最近邻居。 (请注意,记录 D 可能有任何边缘条目数(E1,E2...En)。在这种情况下,其他 D1,D2,D3 将有相同的边缘条目数......

请建议我将解决方案作为 MySQL 中的存储过程...

提前致谢


@eggyal:

我尝试构建您回答的类似查询。

查询:

SELECT   b.id,SQRT(SUM(POW(a.score - b.score, 2))) score1
FROM     (select * from data d1 where  d1.id = (select max(t1.id) from Timestamp t1) 
and d1.edge_id in (select m1.src_edge from mapping m1
where m1.dst = (select m2.src from mapping m2 where m2.src_edge=2 limit 1))) a
JOIN (select * from data d2 where d2.id in ( select t2.id from Timestamp t2 where DAYOFWEEK(NOW())=DAYOFWEEK(t2.timestamp)) and d2.edge_id in (select m3.src_edge from mapping m3 
where m3.dst = (select m4.src from mapping m4 
where m4.src_edge=2 limit 1))) as b
ON b.id <> a.id AND b.edge_id = a.edge_id 
GROUP BY b.id
ORDER BY score1
LIMIT    5;

但是,这不是设计良好的查询。请提出上述查询所需的所有改进。

提前致谢

用于上述查询的表: 数据表:

CREATE TABLE `data` (
  `id` bigint(20) DEFAULT NULL,
  `edge_id` bigint(20) NOT NULL,
  `score` int(11) NOT NULL,
  KEY `edge_id` (`edge_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (edge_id)
(PARTITION p0 VALUES LESS THAN (10000) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (20000) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (30000) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (40000) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN (50000) ENGINE = InnoDB,
 PARTITION p5 VALUES LESS THAN (60000) ENGINE = InnoDB,
 PARTITION p6 VALUES LESS THAN (70000) ENGINE = InnoDB,
 PARTITION p7 VALUES LESS THAN (80000) ENGINE = InnoDB) */$$

映射表;

CREATE TABLE `mapping` (
  `dst` bigint(20) NOT NULL DEFAULT '0',
  `src` bigint(20) DEFAULT NULL,
  `src_edge` bigint(20) NOT NULL DEFAULT '0',
  PRIMARY KEY (`dst`,`src_edge`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1$$

时间戳表:

CREATE TABLE `Timestamp` (
  `timestamp` datetime NOT NULL,
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `time` (`timestamp`,`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1$$

【问题讨论】:

  • 为什么需要存储过程?为什么不简单地执行自连接并按计算的距离排序?
  • @eggyal:因为在这个查询执行之前我必须做一些预处理。您可以建议相同的非存储过程解决方案。
  • 我只是建议了一个“相同的非存储过程解决方案”?
  • @eggyal:我想要精确的 SQL 查询来解决上述问题..
  • 考虑提供适当的 DDL(和/或 sqlfiddle)以及所需的结果集

标签: mysql stored-procedures nearest-neighbor


【解决方案1】:

正如my comment above 中提到的,为什么不简单地执行自连接并按计算的距离排序?

SELECT   b.date
FROM     my_table a
    JOIN my_table b ON b.date <> a.date AND b.edge = a.edge
WHERE    a.date = ?
GROUP BY b.date
ORDER BY SUM(POW(a.value - b.value, 2))
LIMIT    ?

sqlfiddle 上查看。

【讨论】:

  • 请参阅以下答案并提出所需的修改建议
【解决方案2】:

也许这可以让你开始......

CREATE TABLE nodes
(node_id CHAR(2) NOT NULL,   plane CHAR(1) NOT NULL, value INT NOT NULL, PRIMARY KEY(node_id,plane));

INSERT INTO nodes VALUES
('D','x',5),
('D','y',10),
('D','z',15),
('D1','x',20),
('D1','y',25),
('D1','z',30);

CREATE VIEW v_nodes AS 
SELECT node_id
    , MAX(CASE WHEN plane = 'x' THEN value END) x
    , MAX(CASE WHEN plane = 'y' THEN value END) y
    , MAX(CASE WHEN plane = 'z' THEN value END) z
 FROM nodes 
GROUP
   BY node_id;

SELECT ROUND(SQRT
                ( POW(ABS(d.x - d1.x),2) 
                + POW(ABS(d.y - d1.y),2)
                + POW(ABS(d.z - d1.z),2)
                )
           ,2)delta
  FROM v_nodes d
  JOIN v_nodes d1 
 WHERE d.node_id = 'd'
   AND d1.node_id = 'd1';

+-------+
| delta |
+-------+
| 25.98 |
+-------+   

【讨论】:

    【解决方案3】:

    这是最终的解决方案:

    SELECT   d2.id, SQRT(SUM(POW(d1.score - d2.score, 2))) score1
        FROM     data d1
            JOIN data d2 ON d2.id <> d1.id AND d2.edge_id = d1.edge_id
            JOIN mapping m1 ON m1.src_edge = d1.edge_id
            JOIN mapping m2 ON m2.src = m1.dst
            JOIN (SELECT MAX(t1.id) as id  FROM Timestamp) t1 ON t1.id = d1.id
            JOIN Timestamp t2 ON t2.id = d2.id
        WHERE    m2.src_edge = 2
             AND DAYOFWEEK(NOW()) = DAYOFWEEK(t2.timestamp)
        GROUP BY d2.id
        ORDER BY score1
        LIMIT    5;
    

    这个解决方案是eggyal给出的。

    我完全同意你的观点,我应该学习加入。无论我提出什么解决方案都是我最好的。我一定会尽快学习加入。 感谢您的大力帮助...

    【讨论】:

    • @eggyal:感谢您的解决方案。我会尽早学习JOINS..