【发布时间】:2017-05-30 22:47:48
【问题描述】:
如何在 PySpark 中更新列元数据? 我有与分类(字符串)特征的标称编码相对应的元数据值,我想以自动方式将它们解码回来。除非您重新创建架构,否则无法直接在 pyspark API 中写入元数据。如果提供完整的模式描述(如here 所述),是否可以在不将数据集转换为 RDD 并将其转换回来的情况下随时随地在 PySpark 中编辑元数据?
示例列表:
# Create DF
df.show()
# +---+-------------+
# | id| features|
# +---+-------------+
# | 0|[1.0,1.0,4.0]|
# | 1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column
# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# | 0| [1.0]|
# | 1| [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with encoding
# Get rid of the singular array
udf = UserDefinedFunction(lambda x: float(x[0]), returnType=DoubleType())
df2 = df.select(*[udf(column).alias(column) if column == 'categoryIndex' else column for column in df.columns])
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# | 0| 1.0|
# | 1| 2.0|
# +---+-------------+
# - Metadata is lost for that one
# Write metadata
extract = {...}
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
# metadata is readable from df2.schema.fields[1].metadata but is not affective.
# Saving and restoring df from parque destroys the change
# Decode categorical
df = IndexToString(inputCol="categoryIndex", outputCol="category").transform(df)
# ERROR. Was supposed to decode the categorical values
Question 提供了有关如何使用 VectorAssembler、VectorIndexer 以及如何通过使用 StructType 构建完整架构来添加元数据的见解,但没有回答我的问题。
【问题讨论】:
标签: apache-spark pyspark metadata apache-spark-ml