【发布时间】:2019-01-20 20:32:24
【问题描述】:
我已经运行了以下示例:
Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.
它可以与EuclideanDistanceMeasure 一起正常工作。但我不确定为什么文本预期的距离测量(TanimotoDistanceMeasure 和CosineDistanceMeasure)只给我一个集群。
这是为什么?我并没有假装我对这 2 个距离测量结果不满意的情况一无所知 - 但我可能需要改变什么?那里有太多数字,我无法理解每个数字的影响。我确实有《Mahout in Action》这本书,虽然我只读了两章。
EuclideanDistanceMeasure(2 个集群 - 好)
Clusters:
7 -> wt: 1.0 distance: 4.4960791719810365 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
7 -> wt: 1.0 distance: 4.496079376645949 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
7 -> wt: 1.0 distance: 4.496079576525459 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
9 -> wt: 1.0 distance: 4.389955960700927 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
9 -> wt: 1.0 distance: 4.389956011306051 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
9 -> wt: 1.0 distance: 4.3899560687101395 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
9 -> wt: 1.0 distance: 4.389956137136399 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
7 -> wt: 1.0 distance: 5.577549042707083 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
9 -> wt: 1.0 distance: 4.389956708176695 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
9 -> wt: 1.0 distance: 4.389471924190491 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作者:
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
CosineDistanceMeasure(只有 1 个集群 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.6362357041216559 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.636235704121656 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.5876411474816594 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.6328896123664868 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
制作者
CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
TanimotoDistanceMeasure(只有 1 个集群 - 不好)
Clusters:
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8637279689324617 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
0 -> wt: 1.0 distance: 0.8723755210900389 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
0 -> wt: 1.0 distance: 0.8596377086023765 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
通过
制作 CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
true, 0, true);
FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
【问题讨论】:
-
在我看来,在那个玩具数据上,1 个集群是更好的结果。
-
我认为即使在我的真实数据上也会发生这种情况。有什么建议可以让第 11 个玩具文档获得第 2 个集群?
-
对于所有这些措施,您需要更长的文档才能使其正常工作。
-
哦。我正在尝试一些更大的文档(1-3 个非散文段落),但也只得到了一个集群。但是感谢您的反馈,我将使用数据集并尝试确定因果关系。
-
@Anony-Mousse - 非常感谢您的支持,事实证明您的第一条评论是对的。如果您将我的答案复制并粘贴为您自己的帖子,我会将其标记为已接受的答案,以便您获得信用。