【发布时间】:2017-11-02 23:22:51
【问题描述】:
我有一个 rdd 元组,前两行如下所示:
[[('n', 12.012457082117459), ('s', 0.79112758892014912)],
[('t', 3.6243409329763652),('vn', 3.6243409329763652),('n', 52.743253562212828),('v', 11.644347760553064)]]
在每个元组中,第一个值,例如:'n','s','t',是所需的列名,第二个值,例如:12.012, 0.7911.... 是所需的值每一列。但是,在 rdd 的每个列表(行)中,我们可以看到并非所有列名都在那里。例如,在第一行,只有
'n', 's'
出现了,而没有出现
's'
在第二行。因此,我想将此 rdd 转换为数据框,其中未显示在原始元组中的列的值应为 0。换句话说,前两行可能如下所示:
n s t vn omitted.....
12 0.79 0 0 .....
52 0 3.62 3.62 .......
我尝试了以下操作:
row = Row('l','eng','q','g','j','b','nt','z','n','d','f','i','k','s','vn','nz','v','nrt','tg','nrfg','t','ng','zg','a')
df = tup_sum_data.map(row).toDF()
Row() 中的字符串是我想要的列名。但我收到以下错误:
TypeError Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968 try:
--> 969 return _infer_schema(obj)
970 except TypeError:
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_schema(row)
991 else:
--> 992 raise TypeError("Can not infer schema for type: %s" % type(row))
993
TypeError: Can not infer schema for type: <class 'numpy.float64'>
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968 try:
--> 969 return _infer_schema(obj)
970 except TypeError:
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
969 return _infer_schema(obj)
970 except TypeError:
--> 971 raise TypeError("not supported type: %s" % type(obj))
972
973
TypeError: not supported type: <class 'tuple'>
错误代码中的某些行被省略。谁能帮我弄清楚如何处理这个问题?谢谢!
更新 我将数据类型从 np.float64 转换为 float,没有错误。但是,数据框看起来不像我想要的;它看起来像这样:
+--------------------+
| l|
+--------------------+
|[[n,12.0124570821...|
|[[t,3.62434093297...|
|[[a,0.44628710262...|
|[[n,16.7534769832...|
|[[n,17.6017774340...|
+--------------------+
only showing top 5 rows
那么任何人都可以帮助我如何获得格式正确的数据框吗?谢谢!
【问题讨论】:
标签: python pyspark apache-spark-sql rdd pyspark-sql