【发布时间】:2021-08-28 14:06:46
【问题描述】:
我尝试将 pandas 数据帧转换为 PySpark 格式
mySchema = StructType([ StructField("movieId", IntegerType())\
,StructField("title", StringType()),
StructField("userId", IntegerType()),
StructField("rating", FloatType())
])
movielens = spark.createDataFrame(merged_df, mySchema)
movielens.printSchema()
架构
root
|-- movieId: integer (nullable = true)
|-- title: string (nullable = true)
|-- userId: integer (nullable = true)
|-- rating: float (nullable = true)
然后为模型准备我的数据
(train_set,temp) = movielens.randomSplit([8.0,1.0],seed=1)
validation_set = (temp.join(train_set,["userId"],"left_semi").join(train_set,["movieId"],"left_semi"))
removed = (temp.join(validation_set,["movieId","userId"],"left_anti"))
train_set = train_set.union(removed)
als = ALS(
userCol = "userId",
itemCol = "movieId",
ratingCol = "rating"
)
evaluator = RegressionEvaluator(
metricName = "rmse",
labelCol = "rating",
predictionCol = "prediction"
)
model = als.fit(train_set)
predictions = model.transform(validation_set)
然后犯了那个错误
IllegalArgumentException: requirement failed: Column userId must be of type numeric but was actually of type string.
这怎么可能,因为我在 MySchema 中手动编写了类型? 任何帮助将不胜感激
【问题讨论】:
-
尝试
train_set.printSchema()确认数据类型? -
没有任何改变。列的类型保持不变
标签: apache-spark pyspark