使用 python 查找相似的实体答案

【问题标题】：Finding similar entities using python使用 python 查找相似的实体
【发布时间】：2020-11-29 09:01:19
【问题描述】：

我试图找出一种方法来衡量对象之间的相似性，在这种情况下是商店。假设我们有 5 家商店的列表。我们有以下每个月度指标：

月份 - 相关月份 (Jan - Dez)
TotalSales - 总销售额
NumCustomers - 在商店购买的客户数量
AvgUnitPrice - 他们为每件商品支付的平均价格。

数据集的样本如下所示：

Store   Month   TotalSales  NumCustomers    AvgUnitPrice
  1      Jan        100          10              5.00
  2      Jun        150          12              4.70
  3      Mar        200          20              4.95
  4      Apr        100          13              3.80
  5      Dec        300          25              4.36

我有 6 个具有相同变量（TotalSales、NumCustomers 和 AvgUnitPrice）的商店。

根据上述指标，我如何量化每个商店 (1 - 5) 与商店 6 的相似程度？

我假设了两种方法，只是还不知道如何实现它们。

方法 1：使用计算人员相关性的函数。示例输出（商店 1 - 商店 6 = 86%）
方法 2：使用计算距离的模型（例如 KNN）来确定哪些商店是“最近的”。

不胜感激有关此事的任何指导。和平:)

【问题讨论】：

一种解决方案是使用 pandas 数据帧来存储您的数据（您可能已经在使用），然后对方法 1 使用 pandas.DataFrame.corrwith() 方法，然后对方法 1 使用一些 Sklearn.neighbours 方法方法2。

标签： python pandas knn

【解决方案1】：

一个简单的版本是向量余弦相似度。 sklearn 包含一个实现，所以是这样的（将月份转换为数值，然后在计算相似度之前对特征进行归一化）：

from sklearn.metrics import pairwise
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

df_dict = {'Store': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 'Month': {0: 'Jan', 1: 'Jun', 2: 'Mar', 3: 'Apr', 4: 'Dec'},
 'TotalSales': {0: 100, 1: 150, 2: 200, 3: 100, 4: 300},
 'NumCustomers': {0: 10, 1: 12, 2: 20, 3: 13, 4: 25},
 'AvgUnitPrice': {0: 5.0, 1: 4.7, 2: 4.95, 3: 3.8, 4: 4.36}}

d = {"Jan":1, "Feb":2, "Mar":3, "Apr":4, "May":5, "Jun":6, "Jul":7, "Aug":8, "Sep":9, "Oct":10, "Nov":11, "Dec":12}
    
df = pd.DataFrame.from_dict(df_dict)
# generate period month feature - nearby months more similar
df["Month"] = np.sin(df["Month"].map(d)/12*2*np.pi)
X = df.drop(columns="Store")
X = pd.DataFrame(sklearn.preprocessing.normalize(X, axis=0), columns=X.columns)

m_cos = cosine_similarity(X, X)

df_cos = pd.DataFrame(m_cos, columns=df["Store"], index=df["Store"])

输出：

Store         1         2         3         4         5
Store
1      1.000000  0.847529  0.948407  0.939495  0.743462
2      0.847529  1.000000  0.759483  0.663606  0.938521
3      0.948407  0.759483  1.000000  0.982677  0.757679
4      0.939495  0.663606  0.982677  1.000000  0.630423
5      0.743462  0.938521  0.757679  0.630423  1.000000

【讨论】：