【问题标题】:pyspark equivalent of adding a constant array to a dataframe as columnpyspark 相当于将常量数组作为列添加到数据帧
【发布时间】:2020-04-19 06:34:12
【问题描述】:

以下代码在 scala-spark 中工作

scala> val ar = Array("oracle","java")
ar: Array[String] = Array(oracle, java)

scala> df.withColumn("tags",lit(ar)).show(false)
+------+---+----------+----------+--------------+
|name  |age|role      |experience|tags          |
+------+---+----------+----------+--------------+
|John  |25 |Developer |2.56      |[oracle, java]|
|Scott |30 |Tester    |5.2       |[oracle, java]|
|Jim   |28 |DBA       |3.0       |[oracle, java]|
|Mike  |35 |Consultant|10.0      |[oracle, java]|
|Daniel|26 |Developer |3.2       |[oracle, java]|
|Paul  |29 |Tester    |3.6       |[oracle, java]|
|Peter |30 |Developer |6.5       |[oracle, java]|
+------+---+----------+----------+--------------+


scala>

如何在 pyspark 中获得相同的行为?我在下面尝试过,但它不起作用并抛出 Java 错误

>>> from pyspark.sql.types import *

>>> tag=["oracle","java"]
>>> df2.withColumn("tags",lit(tag)).show()

错误

: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [oracle, java]

【问题讨论】:

标签: apache-spark pyspark


【解决方案1】:

arscala 中声明和tagpython 中声明是有区别的。 ararray 类型,但 tagList 类型和 lit 不允许 List 这就是它给出错误的原因。

您需要安装numpy 来声明array,如下所示

import numpy as np
tag = np.array(("oracle","java"))

仅供参考如果在scala中使用List也会报错

scala> val ar = List("oracle","java")
ar: List[String] = List(oracle, java)

scala> df.withColumn("newcol", lit(ar)).printSchema
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(oracle, java)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
  at org.apache.spark.sql.functions$.lit(functions.scala:110)

【讨论】:

  • 不,它给出错误..tag=["oracle","java"]; tag2=np.array(tag) 有效,但 df.withColumn("tag",lit(tag2)) 再次抛出错误
  • 你为什么使用 tag2=np.array(tag) 你应该使用 tag = np.array(("oracle","java")) 正如我提到的。
【解决方案2】:

我发现下面的列表理解有效

>>> arr=["oracle","java"]
>>> mp=[ (lambda x:lit(x))(x) for x in arr ]
>>> df.withColumn("mk",array(mp)).show()
+------+---+----------+----------+--------------+
|  name|age|      role|experience|            mk|
+------+---+----------+----------+--------------+
|  John| 25| Developer|      2.56|[oracle, java]|
| Scott| 30|    Tester|       5.2|[oracle, java]|
|   Jim| 28|       DBA|       3.0|[oracle, java]|
|  Mike| 35|Consultant|      10.0|[oracle, java]|
|Daniel| 26| Developer|       3.2|[oracle, java]|
|  Paul| 29|    Tester|       3.6|[oracle, java]|
| Peter| 30| Developer|       6.5|[oracle, java]|
+------+---+----------+----------+--------------+

>>>

【讨论】:

    【解决方案3】:

    您可以从函数模块导入数组

    >>> from pyspark.sql.types import *
    >>> from pyspark.sql.functions import array
    
    >>> tag=array(lit("oracle"),lit("java")
    >>> df2.withColumn("tags",tag).show()
    

    以下测试

    >>> from pyspark.sql.functions import array
    
    >>> tag=array(lit("oracle"),lit("java"))
    >>> 
    >>> ranked.withColumn("tag",tag).show()
    +------+--------------+----------+-----+----+----+--------------+               
    |gender|    ethinicity|first_name|count|rank|year|           tag|
    +------+--------------+----------+-----+----+----+--------------+
    |  MALE|      HISPANIC|    JAYDEN|  364|   1|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
    |  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
    |  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
    |  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
    |  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
    |  MALE|      HISPANIC|     ANGEL|  236|  18|2012|[oracle, java]|
    |  MALE|      HISPANIC|     AIDEN|  235|  19|2012|[oracle, java]|
    |  MALE|WHITE NON HISP|    DANIEL|  232|  20|2012|[oracle, java]|
    +------+--------------+----------+-----+----+----+--------------+
    only showing top 20 rows
    

    【讨论】:

    • 只是好奇..如何以相同的方式分配python字典?
    • 尝试使用 create_map() 函数
    • tag2= lit("{'a':1,'b':2}") rank.withColumn("tag",tag2).show()
    猜你喜欢
    • 2018-07-25
    • 2022-01-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-03-19
    • 1970-01-01
    • 2022-08-10
    • 1970-01-01
    相关资源
    最近更新 更多