【问题标题】:Pyspark Convert RDD of tuples to DataframePyspark 将元组的 RDD 转换为 Dataframe
【发布时间】:2017-11-02 23:22:51
【问题描述】:

我有一个 rdd 元组,前两行如下所示:

[[('n', 12.012457082117459), ('s', 0.79112758892014912)],
[('t', 3.6243409329763652),('vn', 3.6243409329763652),('n', 52.743253562212828),('v', 11.644347760553064)]]

在每个元组中,第一个值,例如:'n','s','t',是所需的列名,第二个值,例如:12.012, 0.7911.... 是所需的值每一列。但是,在 rdd 的每个列表(行)中,我们可以看到并非所有列名都在那里。例如,在第一行,只有

'n', 's' 

出现了,而没有出现

's' 

在第二行。因此,我想将此 rdd 转换为数据框,其中未显示在原始元组中的列的值应为 0。换句话说,前两行可能如下所示:

n     s      t       vn     omitted.....
12    0.79   0       0      ..... 
52    0      3.62    3.62    .......

我尝试了以下操作:

row = Row('l','eng','q','g','j','b','nt','z','n','d','f','i','k','s','vn','nz','v','nrt','tg','nrfg','t','ng','zg','a')
df = tup_sum_data.map(row).toDF()

Row() 中的字符串是我想要的列名。但我收到以下错误:

TypeError                                 Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968         try:
--> 969             return _infer_schema(obj)
970         except TypeError:

/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_schema(row)
991     else:
--> 992         raise TypeError("Can not infer schema for type: %s" % type(row))
993 

TypeError: Can not infer schema for type: <class 'numpy.float64'>
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968         try:
--> 969             return _infer_schema(obj)
970         except TypeError:

/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
969             return _infer_schema(obj)
970         except TypeError:
--> 971             raise TypeError("not supported type: %s" % type(obj))
972 
973 

TypeError: not supported type: <class 'tuple'>

错误代码中的某些行被省略。谁能帮我弄清楚如何处理这个问题?谢谢!

更新 我将数据类型从 np.float64 转换为 float,没有错误。但是,数据框看起来不像我想要的;它看起来像这样:

+--------------------+
|                   l|
+--------------------+
|[[n,12.0124570821...|
|[[t,3.62434093297...|
|[[a,0.44628710262...|
|[[n,16.7534769832...|
|[[n,17.6017774340...|
+--------------------+
only showing top 5 rows

那么任何人都可以帮助我如何获得格式正确的数据框吗?谢谢!

【问题讨论】:

    标签: python pyspark apache-spark-sql rdd pyspark-sql


    【解决方案1】:
    from pyspark.sql.types import *
    from pyspark.sql import *
    
    data_frame_schema = StructType([
        StructField("n", FloatType()),
        StructField("s", FloatType()),
        StructField("t", FloatType()),
        StructField("v", FloatType()),
        StructField("vn", FloatType())
    ])
    
    raw_list = [[('n', 12.012457082117459), ('s', 0.79112758892014912)], \
    [('t', 3.6243409329763652),('vn', 3.6243409329763652),('n', 52.743253562212828),('v', 11.644347760553064)]]
    
    raw_rdd = sc.parallelize(raw_list)
    
    # dict_to_row = lambda d: Row(n=d.get("n"), s=d.get("s"), t=d.get("t"), v=d.get("v"), vn=d.get("vn"))
    dict_to_row = lambda d: Row(n=d.get("n", 0.0), s=d.get("s", 0.0), t=d.get("t", 0.0), v=d.get("v", 0.0), vn=d.get("vn", 0.0))
    
    row_rdd = raw_rdd.map(lambda l: dict_to_row(dict(l)))
    df = spark.createDataFrame(row_rdd, data_frame_schema)
    df.show()
    

    将以上内容粘贴到 pyspark shell 中会产生输出:

    +---------+----------+--------+---------+--------+
    |        n|         s|       t|        v|      vn|
    +---------+----------+--------+---------+--------+
    |12.012457|0.79112756|     0.0|      0.0|     0.0|
    | 52.74325|       0.0|3.624341|11.644348|3.624341|
    +---------+----------+--------+---------+--------+
    

    【讨论】:

    • 这成功了!太感谢了。顺便问一下,有什么方法可以将null转为0吗?
    • 刚刚更新了答案,使用 0.0 表示缺失值。这行得通吗?
    • 效果很好!谢谢
    猜你喜欢
    • 2016-05-29
    • 2021-06-29
    • 2023-03-13
    • 2018-09-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-01-24
    • 1970-01-01
    相关资源
    最近更新 更多