【发布时间】:2020-05-03 11:08:41
【问题描述】:
我有一份工作分配给工人,每个工人输出一个需要写入配置单元的数据帧,我无法弄清楚如何在不初始化另一个 sparkcontext 的情况下从工人访问配置单元,所以我尝试收集他们的输出并插入像下面这样一次性完成
result = df.rdd.map(lambda rdd: predict_item_by_model(rdd, columns)).collect()
df_list = sc.parallelize(result).map(lambda df: hiveContext.createDataFrame(df)).collect() #throws error
mergedDF = reduce(DataFrame.union, df_list)
mergedDF.write.mode('overwrite').partitionBy("item_id").saveAsTable("items")
但是现在它抛出了这个错误
_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
是否可以直接从工人那里访问蜂巢?如果没有,我怎样才能收集数据并插入一次?
【问题讨论】:
标签: apache-spark pyspark hive