pandas DataFrame中行的余弦相似度答案

【问题标题】：Cosine similarity of rows in pandas DataFramepandas DataFrame中行的余弦相似度
【发布时间】：2018-06-16 10:02:46
【问题描述】：

我计算了类似于以下的数据帧的余弦相似度：

ciiu4n4  A0111  A0112  A0113   
 A0111      14      7      6 
 A0112      16     55      3 
 A0113      15      0    112

使用此代码：

data_cosine = mpg_data.drop(['ciiu4n4'], axis=1)
result = cosine_similarity(data_cosine)

我得到一个这样的数组：

[[ 1.          0.95357118  0.95814892 ]
 [ 0.95357118  1.          0.89993795 ]
 [ 0.95814892  0.89993795  1.         ]]

但是，我需要将结果作为类似于原始数据框的数据框。我不能手动做，因为原始数据框是 600 x 600。

我需要的结果看起来类似于：

ciiu4n4   A0111        A0112        A0113       
 A0111    1.           0.95357118   0.95814892
 A0112    0.95357118   1.           0.89993795
 A0113    0.95814892   0.89993795   1.

【问题讨论】：

标签： python pandas dataframe cosine-similarity

【解决方案1】：

我建议稍微改变你的方法。无需删除任何列。相反，将第一列设置为索引，计算余弦相似度，并将结果数组分配回数据帧。

df = df.set_index('ciiu4n4')
df

         A0111  A0112  A0113
ciiu4n4                     
A0111       14      7      6
A0112       16     55      3
A0113       15      0    112

v = cosine_similarity(df.values)

df[:] = v
df.reset_index()

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000

仅当行数和列数（不包括第一个）相同时，上述解决方案才有效。所以，这是另一个应该适用于任何场景的解决方案。

df = df.set_index('ciiu4n4')
v = cosine_similarity(df.values)

df = pd.DataFrame(v, columns=df.index.values, index=df.index).reset_index()
df

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000

或者，使用df.insert -

df = pd.DataFrame(v, columns=df.index.values)
df.insert(0, 'ciiu4n4', df.index)
df

  ciiu4n4     A0111     A0112     A0113
0   A0111  1.000000  0.953571  0.958149
1   A0112  0.953571  1.000000  0.899938
2   A0113  0.958149  0.899938  1.000000

【讨论】：

Tks @COLDSPEED。我现在收到一个错误：1 df[:] = v "ValueError: Must have equal len keys and value when setting with an ndarray"
@PAstudilloE 想法是所有未参与计算的列都必须设置为索引。所以请这样做。
@COLDSPEED，唯一不参与计算的列是ciiu4n4，现在它被设置为索引。但我仍然犯同样的错误。 =(
@PAstudilloE 请打印df.shape 和v.shape...？
@COLDSPEED df.shape (390, 414), v.shape (390,390)