如何更改pyspark中的列元数据？答案

【问题标题】：How to change column metadata in pyspark?如何更改pyspark中的列元数据？
【发布时间】：2017-05-30 22:47:48
【问题描述】：

如何在 PySpark 中更新列元数据？我有与分类（字符串）特征的标称编码相对应的元数据值，我想以自动方式将它们解码回来。除非您重新创建架构，否则无法直接在 pyspark API 中写入元数据。如果提供完整的模式描述（如here 所述），是否可以在不将数据集转换为 RDD 并将其转换回来的情况下随时随地在 PySpark 中编辑元数据？

示例列表：

# Create DF
df.show()

# +---+-------------+
# | id|     features|
# +---+-------------+
# |  0|[1.0,1.0,4.0]|
# |  1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column

# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|        [1.0]|
# |  1|        [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with encoding

# Get rid of the singular array
udf = UserDefinedFunction(lambda x: float(x[0]), returnType=DoubleType())
df2 = df.select(*[udf(column).alias(column) if column == 'categoryIndex' else column for column in df.columns])
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|          1.0|
# |  1|          2.0|
# +---+-------------+
# - Metadata is lost for that one


# Write metadata
extract = {...}
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
# metadata is readable from df2.schema.fields[1].metadata but is not affective. 
# Saving and restoring df from parque destroys the change
# Decode categorical
df = IndexToString(inputCol="categoryIndex", outputCol="category").transform(df)
# ERROR. Was supposed to decode the categorical values

Question 提供了有关如何使用 VectorAssembler、VectorIndexer 以及如何通过使用 StructType 构建完整架构来添加元数据的见解，但没有回答我的问题。

【问题讨论】：

标签： apache-spark pyspark metadata apache-spark-ml

【解决方案1】：

在这两种情况下都会丢失元数据：

当您调用 Python udf 时，输入 Column 及其元数据和输出 Column 之间没有关系。 UserDefinedFunction（在 Python 和 Scala 中）是 Spark 引擎的黑匣子。

将数据直接分配给 Python 模式对象：

df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)

根本不是有效的方法。 Spark DataFrame 是围绕 JVM 对象的事物包装器。 Python 包装器中的任何更改对于 JVM 后端都是完全不透明的，并且根本不会传播：

import json 

df = spark.createDataFrame([(1, "foo")], ("k", "v"))
df.schema[-1].metadata = {"foo": "bar"}

json.loads(df._jdf.schema().json())

## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
##   {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
## 'type': 'struct'}

甚至保存在 Python 中：

df.select("*").schema[-1].metadata
## {}

使用 Spark ，您可以使用小型包装器（取自 Spark Gotchas，由我和 @eliasah 维护）：

def withMeta(self, alias, meta):
    sc = SparkContext._active_spark_context
    jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
    return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))

df.withColumn("foo", withMeta(col("foo"), "", {...}))

使用 Spark >= 2.2，您可以使用 Column.alias：

df.withColumn("foo", col("foo").alias("", metadata={...}))

【讨论】：

在 spark 2.2 而不是 df.withColumn("foo", col("foo").alias("", metadata={...})) 中应该是 df.withColumn("foo", col("foo").as("", metadata={...})) ？参考：spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/…
@Sigrist 这些是 Java 文档。在 Python 中使用 as 是一个关键字。