【问题标题】:kmeans pyspark org.apache.spark.SparkException: Job aborted due to stage failurekmeans pyspark org.apache.spark.SparkException:作业因阶段失败而中止
【发布时间】:2020-11-07 03:17:47
【问题描述】:

我想在我的基础上使用 k-means(670 万行和 22 个变量),

base.dtypes

 ('anonimisation2', 'double'),
 ('anonimisation3', 'double'),
 ('anonimisation4', 'double'),
 ('anonimisation5', 'double'),
 ('anonimisation6', 'double'),
 ('anonimisation7', 'double'),
 ('anonimisation8', 'double'),
 ('anonimisation9', 'double'),
 ('anonimisation10', 'double'),
 ('anonimisation11', 'double'),
 ('anonimisation12', 'double'),
 ('anonimisation13', 'double'),
 ('anonimisation14', 'double'),
 ('anonimisation15', 'double'),
 ('anonimisation16', 'double'),
 ('anonimisation17', 'double'),
 ('anonimisation18', 'double'),
 ('anonimisation19', 'double'),
 ('anonimisation20', 'double'),
 ('anonimisation21', 'double'),
 ('anonimisation22', 'double')]

我读到我应该使用这段代码:

def transData(base):
    return base.rdd.map(lambda r: [Vectors.dense(r[:-1])]).toDF(['features'])
transformed= transData(base)
transformed.show(5, False)

然后我写了这个:

kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(transformed)

我有这个错误:

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'

不知道该怎么办? 如果您想了解更多信息,请询问 谢谢

我尝试在 Pandas 上使用 python,但我也遇到了问题

【问题讨论】:

  • 请在此处发布完整的堆栈跟踪,否则无法理解,错误是什么。

标签: apache-spark pyspark k-means


【解决方案1】:

使用from pyspark.ml.linalg import Vectors 代替from pyspark.mllib.linalg import Vectors

【讨论】:

    猜你喜欢
    • 2019-12-25
    • 2018-03-18
    • 2019-09-24
    • 1970-01-01
    • 2020-10-04
    • 2019-08-29
    • 2022-08-03
    • 2023-03-20
    • 2015-01-09
    相关资源
    最近更新 更多