Spark：Vectorassembler 抛出错误 org.apache.spark.SparkException：作业因阶段失败而中止：答案

【问题标题】：Spark: Vectorassembler throws error org.apache.spark.SparkException: Job aborted due to stage failure:Spark：Vectorassembler 抛出错误 org.apache.spark.SparkException：作业因阶段失败而中止：
【发布时间】：2020-10-04 13:22:39
【问题描述】：

我有一个如下所示的数据框。特征为F1、F2、F3，输出变量为Output

+-----+-----+-------+------+
|   F1|F2   |F3     |0utput|
+-----+-----+-------+------+
|6.575| 4.98|   15.3|504000|
|6.421| 9.14|   17.8|453600|
|7.185| 4.03|   17.8|728700|
|6.998| 2.94|   18.7|701400|
|7.147| 5.33|   18.7|760200|

要让 apache Spark 运行任何 ML 算法，我们需要 2 列、特征和输出标签。特征列是组合所有特征值的向量。为此，我使用向量汇编程序。

from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
data_schema = [StructField('F1',IntegerType(), True),
            StructField('F2',IntegerType(),True),
             StructField('F3', IntegerType(),True),
              StructField('Output', IntegerType(),True)]
final_struc = StructType(fields=data_schema)
training=spark.read.csv('housing.csv', schema=final_struc)
vectorAssembler = VectorAssembler(inputCols = ['F1', 'F2', 'F3'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(training)
vhouse_df = vhouse_df.select(['features', 'Output'])

当我想查看 vhouse_df 时，出现错误

vhouse_df.show()

Py4JJavaError: An error occurred while calling o948.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 22, 10.0.2.15, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$2720/0x00000008410e0840: (struct<F1_double_VectorAssembler_becd63a80d0f:double,F2_double_VectorAssembler_becd63a80d0f:double,F3_double_VectorAssembler_becd63a80d0f:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

【问题讨论】：

你能检查一下预期的数据类型应该通过什么..这可能是由于数据类型问题..也
也 - 一个快速建议 vhouse_df = vhouse_df.select(['features', 'OUTPUT']) To------- vhouse_df = vhouse_df.select('features', 'OUTPUT' )
如何检查要传递的数据类型
您现在可以检查一下吗？架构类型已更新..

标签： apache-spark pyspark

【解决方案1】：

查看您提供的架构 -

在数据集中，所有输入列 - F1、F2 和 F3 均为双精度 - 请将 Integer 更改为 Double

from pyspark.sql import types as T

data_schema = T.StructType([
StructField('F1',T.DoubleType(), True),
StructField('F2',T.DoubleType(),True),
StructField('F3', T.DoubleType(),True),
StructField('Output', T.IntegerType(),True)
])

您也可以尝试这样的事情 - 这是直接映射架构从你的df

training=spark.read.json('housing.csv', schema=df.schema)

也是一个快速的建议----------

在你的代码中改变它

vhouse_df = vhouse_df.select(['features', 'OUTPUT'])

到-------

vhouse_df = vhouse_df.select('features', 'OUTPUT')

---------更新------

我不确定您是否正在寻找以下内容 - 对我来说工作正常 -

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler

df = spark.read.csv('/FileStore/tables/datasets_1379_2485_housing.csv', header="true", inferSchema="true")
vectorAssembler = VectorAssembler(inputCols = ['RM', 'LSTAT', 'PTRATIO'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(df)
vhouse_df.show()

-------输出----------

+-----+-----+-------+--------+------------------+
|   RM|LSTAT|PTRATIO|    MEDV|          features|
+-----+-----+-------+--------+------------------+
|6.575| 4.98|   15.3|504000.0| [6.575,4.98,15.3]|
|6.421| 9.14|   17.8|453600.0| [6.421,9.14,17.8]|
|7.185| 4.03|   17.8|728700.0| [7.185,4.03,17.8]|
|6.998| 2.94|   18.7|701400.0| [6.998,2.94,18.7]|
|7.147| 5.33|   18.7|760200.0| [7.147,5.33,18.7]|
| 6.43| 5.21|   18.7|602700.0|  [6.43,5.21,18.7]|
|6.012|12.43|   15.2|480900.0|[6.012,12.43,15.2]|
|6.172|19.15|   15.2|569100.0|[6.172,19.15,15.2]|
|5.631|29.93|   15.2|346500.0|[5.631,29.93,15.2]|
|6.004| 17.1|   15.2|396900.0| [6.004,17.1,15.2]|
|6.377|20.45|   15.2|315000.0|[6.377,20.45,15.2]|
|6.009|13.27|   15.2|396900.0|[6.009,13.27,15.2]|
|5.889|15.71|   15.2|455700.0|[5.889,15.71,15.2]|
|5.949| 8.26|   21.0|428400.0| [5.949,8.26,21.0]|
|6.096|10.26|   21.0|382200.0|[6.096,10.26,21.0]|
|5.834| 8.47|   21.0|417900.0| [5.834,8.47,21.0]|
|5.935| 6.58|   21.0|485100.0| [5.935,6.58,21.0]|
| 5.99|14.67|   21.0|367500.0| [5.99,14.67,21.0]|
|5.456|11.69|   21.0|424200.0|[5.456,11.69,21.0]|
|5.727|11.28|   21.0|382200.0|[5.727,11.28,21.0]|
+-----+-----+-------+--------+------------------+

【讨论】：

我现在在做vhouse_df = vhouse_df.select('features', 'OUTPUT')之后得到同样的错误
怎么改成双倍？
请检查 - 答案已更新 - 一旦成功，请不要忘记批准答案.. 提前谢谢你..
嗨。我做了与上面完全相同的事情，改为双精度并删除方括号。仍然不起作用......我为你的努力投票，但不能接受作为答案，因为它不能解决我的问题
好的.. 这取决于你的 json 的结构 - 你能分享一下 json 输入的样子吗？