【问题标题】：MySQL / PHP: Find similar / related items by tag / taxonomyMySQL / PHP：按标签/分类查找相似/相关项目
【发布时间】：2013-08-03 21:44:45
【问题描述】：

我有一个像这样的城市表。

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

我有一个如下所示的标签表。

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |

还有一个 city_tags 表：

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |

我如何计算哪些是最密切相关的城市？例如。如果我查看城市 1（巴黎），结果应该是：伦敦（2），纽约（3）

我找到了Jaccard index，但我不确定如何最好地实现它。

【问题讨论】：

为什么不先从简单的事情开始，比如total no。城市匹配的标签，然后根据 no 找到最近的城市。匹配的标签？
这是您拥有的所有数据吗？您是否允许向数据库添加更多列，例如纬度/经度值？还是每次您想知道这一点时都更喜欢服务器端/客户端 API 调用？
可以看看这个stackoverflow.com/questions/2706499/…>
你如何定义“密切相关”？ 1/（共有#个标签）？
@Tom 查看我更新的sqlfiddle.com/#!2/e7456/1 Jaccard 相似性小提琴

标签： php mysql relationship tagging

【解决方案1】：

为时已晚，但我认为没有一个答案是完全正确的。我得到了每一个中最好的部分，并将所有这些放在一起做出我自己的答案：

@m-khalid-junaid 的Jaccard Index解释非常有趣且正确，但(q.sets + q.parisset) AS union 和(q.sets - q.parisset) AS intersect 的实现非常错误。
@n-lx 的版本是可以的，但是需要Jaccard Index，这个很重要，如果一个城市有2个标签，匹配另一个城市的两个标签有3个标签，结果将与仅具有相同两个标签的另一个城市的匹配结果相同。我认为完整的匹配是最相关的。

我的回答：

cities 这样的表。

| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |

cities_tag 这样的表。

| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |

有了这个样本数据，佛罗伦萨与巴黎有一个完全匹配，纽约匹配一个标签， São Paulo 有 无标签 匹配，London 匹配 两个标签 并有另一个。我认为这个样本的 Jaccard 指数是：

佛罗伦萨： 1.000 (2/2)

伦敦： 0.666 (2/3)

纽约： 0.333 (1/3)

圣保罗： 0.000 (0/3)

我的查询是这样的：

select jaccard.city, 
       jaccard.intersect, 
       jaccard.union, 
       jaccard.intersect/jaccard.union as 'jaccard index'
from 
(select
    c2.name as city
    ,count(ct2.tag_id) as 'intersect' 
    ,(select count(distinct ct3.tag_id) 
      from cities_tags ct3 
      where ct3.city_id in(c1.id, c2.id)) as 'union'
from
    cities as c1
    inner join cities as c2 on c1.id != c2.id
    left join cities_tags as ct1 on ct1.city_id = c1.id
    left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
where c1.id = 1
group by c1.id, c2.id) as jaccard
order by jaccard.intersect/jaccard.union desc

SQL Fidde

【讨论】：

【解决方案2】：

您对我如何计算哪些是最密切相关的城市有疑问？例如。如果我查看城市 1（巴黎），结果应该是：伦敦 (2)、纽约 (3) 并且根据您提供的数据集，只有一件事可以关联，即城市之间的公共标签，因此共享公共标签的城市将是下面最接近的城市是查找共享公共标签的城市（除了提供查找其最近的城市）的子查询

SELECT * FROM `cities`  WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

工作

我假设您将输入城市 id 或名称之一以找到最接近的城市 ID，在我的情况下，“Paris”的 ID 为 1

 SELECT tag_id FROM `cities_tags` WHERE city_id=1

它会找到paris所有的标签id然后

SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

它将获取除巴黎之外的所有城市，这些城市具有与巴黎相同的标签

这是你的Fiddle

在阅读 Jaccard 相似度/索引 时，发现一些东西可以理解这些术语的实际含义，让我们以这个例子为例，我们有两组 A 和 B

设置 A={A, B, C, D, E}

设置 B={I、H、G、F、E、D}

jaccard相似度计算公式为JS=(A intersect B)/(A 联合 B)

A 相交 B = {D,E}= 2

联合 B ={A, B, C, D, E,I, H, G, F} =9

JS=2/9 =0.2222222222222222

现在转向你的场景

Paris 的 tag_ids 为 1,3，所以我们制作了这个集合并调用我们的 Set P ={欧洲，河流}

London 的 tag_ids 为 1,3，所以我们设置了这个并调用我们的设置 L ={Europe,River}

纽约的 tag_ids 为 2,3，所以我们设置了这个并调用我们的设置 NW ={北美，河流}

计算 JS Paris 和 London JSPL = P intersect L / P union L , JSPL = 2/2 = 1

计算 JS 巴黎和纽约 JSPNW = P intersect NW / P 联合 NW ,JSPNW = 1/3 = 0.3333333333

到目前为止，这是计算完美 jaccard 索引的查询，您可以看到下面的 fiddle 示例

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC

在上面的查询中，我已经将结果集派生为两个子选择，以便获得我的自定义计算别名

您可以在上面的查询中添加过滤器，而不是计算与自身的相似度

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE  cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC

所以结果显示巴黎与伦敦密切相关，然后又与纽约相关

Jaccard Similarity Fiddle

【讨论】：

有趣的实现，我很想知道这如何与我的大型数据集的实现相适应。
@TheGunner 一些缓存应该很有用，因为您的标签不会经常更改。
@TheGunner 如果你注意到现实生活中的场景和地图，它基于城市和国家，所以这个数据集不会比较大，也没有理由对性能感到好奇，有很多可用的解决方案对于 RDBMS 和 OP 中的优化，应该关注列的缓存、正确的索引、关系和数据类型
您必须在所有派生表的最顶层父级中添加LIMIT，例如在查询末尾LIMIT 10
所以我很确定这个解决方案不可能工作。我用不同的数据做了一个小提琴：sqlfiddle.com/#!2/ad2a9/1。我通过将1 更改为2、3 更改为8、2 更改为5 来更改tag_ids。结果是不同的，即使它应该是相同的（因为城市标签关系保持不变）。这是因为在 q.sets - q.parisset 和 q.sets + q.parisset` 期间，sets 和 parissets 被强制转换为整数（因此只保留第一个逗号之前的部分：2, 2 和 5, 8）。原来的小提琴作品的事实是一个巧合。这不是一个有效的答案。

【解决方案3】：

select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c 
inner join 
  (
  select city_id, count(*) as val 
  from cities_tags 
  where tag_id in (select tag_id from cities_tags where city_id=1) 
  and not city_id in (1)
  group by city_id
  ) as cnt 
on c.id=cnt.city_id
order by jaccard_index desc

此查询静态引用city_id=1，因此您必须在where tag_id in 子句和not city_id in 子句中都将其设为变量。

如果我正确理解了 Jaccard 索引，那么它还会返回按“最密切相关”排序的值。我们示例中的结果如下所示：

|name      |jaccard_index  |
|London    |0.6667         |
|New York  |0.3333         |

编辑

更好地理解如何实现 Jaccard 索引：

在 wikipedia 上阅读了更多关于 Jaccard 索引的内容后，我想出了一种更好的方法来实现对示例数据集的查询。本质上，我们将独立地比较我们选择的城市与列表中的其他城市，并使用公共标签的数量除以两个城市之间选择的不同总标签的数量。

select c.name, 
  case -- when this city's tags are a subset of the chosen city's tags
    when not_in.cnt is null 
  then -- then the union count is the chosen city's tag count
    intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
  else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
    intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
  end as jaccard_index
  -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c 
inner join 
  (
    --  select the count of tags for each city that match our chosen city
    select city_id, count(*) as cnt 
    from cities_tags 
    where tag_id in (select tag_id from cities_tags where city_id=1) 
    and city_id!=1
    group by city_id
  ) as intersection
on c.id=intersection.city_id
left join
  (
    -- select the count of tags for each city that are not in our chosen city's tag list
    select city_id, count(tag_id) as cnt
    from cities_tags
    where city_id!=1
    and not tag_id in (select tag_id from cities_tags where city_id=1)
    group by city_id
  ) as not_in
on c.id=not_in.city_id
order by jaccard_index desc

查询有点长，我不知道它的扩展性如何，但它确实实现了真正的 Jaccard 索引，正如问题中所要求的那样。以下是新查询的结果：

+----------+---------------+
| name     | jaccard_index |
+----------+---------------+
| London   |        1.0000 |
| New York |        0.3333 |
+----------+---------------+

再次编辑以将 cmets 添加到查询中，并考虑当前城市的标签何时是所选城市标签的子集

【讨论】：

我的 Jaccard 索引不正确。我将在今天的某个时候使用正确的实现进行编辑。
请查看实现真正 Jaccard 索引的新查询。

【解决方案4】：

这个查询没有任何花哨的功能，甚至没有子查询。它很快。只需确保 city.id、cities_tags.id、cities_tags.city_id 和 city_tags.tag_id 具有索引即可。

查询返回的结果包含：city1、city2 以及 city1 和 city2 共有多少个标签的 count。 p>

select
    c1.name as city1
    ,c2.name as city2
    ,count(ct2.tag_id) as match_count
from
    cities as c1
    inner join cities as c2 on
        c1.id != c2.id              -- change != into > if you dont want duplicates
    left join cities_tags as ct1 on -- use inner join to filter cities with no match
        ct1.city_id = c1.id
    left join cities_tags as ct2 on -- use inner join to filter cities with no match
        ct2.city_id = c2.id
        and ct1.tag_id = ct2.tag_id
group by
    c1.id
    ,c2.id
order by
    c1.id
    ,match_count desc
    ,c2.id

将!=改为>，避免每个城市返回两次。这意味着一个城市将不再在第一列和第二列中出现一次。

如果您不想看到没有标签匹配的城市组合，请将两个left join 更改为inner join。

【讨论】：

它将返回重复项以及要匹配的城市名称
@dianuj 我在查询中添加了一条评论来解决重复问题。（将!= 更改为>）。你误会了：城市名称不匹配。

【解决方案5】：

这可能是朝着正确的方向推进吗？

SELECT cities.name, ( 
                    SELECT cities.id FROM cities
                    JOIN cities_tags ON cities.id=cities_tags.city_id
                    WHERE tags.id IN(
                                     SELECT cities_tags.tag_id
                                     FROM cites_tags
                                     WHERE cities_tags.city_id=cites.id
                                     )
                    GROUP BY cities.id
                    HAVING count(*) > 0
                    ) as matchCount 
FROM cities
HAVING matchCount >0

我尝试的是这样的：

// 查找城市名称：
获取 city.names (SUBQUERY) 作为 matchCount FROM cities WHERE matchCount >0

// 子查询：
选择城市拥有的标签数量（SUBSUBQUERY）也有

// 子查询
选择原始名称的标签的id

【讨论】：