由于 Spark 2.0 Vectors.sqdist 可用于计算两个向量之间的平方距离。
您可以使用 UDF 计算每个点到其中心的距离,如下所示:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
注意:Vectors.sqdist 计算 2 个向量之间的平方距离(没有平方根)。如果你需要欧几里得距离,你可以使用Math.sqrt(Vectors.sqdist(...))