根据列的字符串值将数字列添加到pyspark DataFrame答案

【问题标题】：Adding numeric column to pyspark DataFrame based on string value of column根据列的字符串值将数字列添加到pyspark DataFrame
【发布时间】：2016-02-15 17:12:40
【问题描述】：

我有一个从 JSON 文件构建的 DataFrame：

{ "1": "a b c d e f", "2": 1, "type": "type1"}
{ "1": "a b c b c", "2": 2, "type": "type1"}
{"1": "d d a b c", "2": 3, "type": "type2"}
...

我正在设计一个朴素贝叶斯分类器，这样的 DataFrame 是我的训练集：分类器将使用从字段 1 中提取的特征，并且类（标签）由字段 type。

我的问题是在拟合模型时出现此错误：

pyspark.sql.utils.IllegalArgumentException: u'requirement failed: 列类型必须是 DoubleType 类型，但实际上是 StringType。'

表示标签字段必须是数字。为了解决这个问题，我试图通过字典将字符串值映射为数值

grouped = df.groupBy(df.type).agg({'*': 'count'}).persist()
types = {row.type: grouped.collect().index(row) for row in grouped.collect()}

然后想法是在DataFrame中添加一个新列，其数值对应于它的字符串值：

df = df.withColumn('type_numeric', types[df.type])

这当然失败了，所以我想知道是否有人对如何实现这一点有更好的想法或建议。

【问题讨论】：

您好，请提出您的问题并先编写 cod，然后再编写您期望的结果，最后是错误消息

标签： python dataframe pyspark

【解决方案1】：

我已经通过对 DataFrame 使用 StringIndexer 解决了。

string_indexer = StringIndexer(inputCol='type', outputCol='type_numeric')
rescaled_data_numeric = string_indexer.fit(df).transform(df)

【讨论】：