【发布时间】:2020-10-04 13:22:39
【问题描述】:
我有一个如下所示的数据框。特征为F1、F2、F3,输出变量为Output
+-----+-----+-------+------+
| F1|F2 |F3 |0utput|
+-----+-----+-------+------+
|6.575| 4.98| 15.3|504000|
|6.421| 9.14| 17.8|453600|
|7.185| 4.03| 17.8|728700|
|6.998| 2.94| 18.7|701400|
|7.147| 5.33| 18.7|760200|
要让 apache Spark 运行任何 ML 算法,我们需要 2 列、特征和输出标签。特征列是组合所有特征值的向量。为此,我使用向量汇编程序。
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
data_schema = [StructField('F1',IntegerType(), True),
StructField('F2',IntegerType(),True),
StructField('F3', IntegerType(),True),
StructField('Output', IntegerType(),True)]
final_struc = StructType(fields=data_schema)
training=spark.read.csv('housing.csv', schema=final_struc)
vectorAssembler = VectorAssembler(inputCols = ['F1', 'F2', 'F3'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(training)
vhouse_df = vhouse_df.select(['features', 'Output'])
当我想查看 vhouse_df 时,出现错误
vhouse_df.show()
Py4JJavaError: An error occurred while calling o948.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 22, 10.0.2.15, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$2720/0x00000008410e0840: (struct<F1_double_VectorAssembler_becd63a80d0f:double,F2_double_VectorAssembler_becd63a80d0f:double,F3_double_VectorAssembler_becd63a80d0f:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
【问题讨论】:
-
你能检查一下预期的数据类型应该通过什么..这可能是由于数据类型问题..也
-
也 - 一个快速建议 vhouse_df = vhouse_df.select(['features', 'OUTPUT']) To------- vhouse_df = vhouse_df.select('features', 'OUTPUT' )
-
如何检查要传递的数据类型
-
您现在可以检查一下吗?架构类型已更新..
标签: apache-spark pyspark